to JPEG Acceleration

- Dr. Konrad Froitzheim, Heiner Wolf
- Department of Distributed Systems
- University of Ulm, Germany
- Oberer Eselsberg
- D 89069 Ulm / Germany
- internet: frz@informatik.uni-ulm.de , wolf@informatik.uni-ulm.de

*Abstract *

*JPEG picture compression and related algorithms are not only used in still
picture compression, but also to a growing degree for moving picture
compression in telecommunication applications. Real-time JPEG compression and
decompression are crucial in these scenarios. We present a method to
significantly improve the performance of software based JPEG decompression. Key
to these performance gains are adequate knowledge of the structure of the JPEG
coded picture information and transfer of structural information between
consecutive processing steps. Our implementation achieved an 80% performance
increase decompressing typical JPEG video streams.*

*Keywords:*

*Video, compression, decompression, JPEG, DCT, telecommunications*

JPEG's 'lossy' baseline process is based on the Discrete Cosine Transform (DCT). It packs the transformed data in sequential order, using zero suppression in combination with Huffman symbol coding. We will focus on the IDCT related steps to show where our improvements apply.

**Figure 1**: JPEG block diagram

**Figure 2**: FDCT example, only the upper left corner contains significant numbers

The 2-dimensional DCT and its inverse are orthogonal transforms. A DCT can be expressed as matrix multiplications

FDCT:

where P denotes the color component block, Q the transformation matrix, and C the coefficient matrix.

The **quantization** prepares the blocks for efficient coding. Each DCT
coefficient block is divided by an 8x8 quantization matrix. Typical
quantization matrices used in telecommunication applications turn small
coefficients and coefficients representing finer detail into zeros. The
compression rate can be adjusted by changing the quantization matrix. Figure 3
shows the sample block after quantization.

**Figure 3**: Quantization: most coefficients are eliminated

The quantized array is then linearized in a zigzag fashion, zeros are suppressed, and it is further reduced in size by an appropriate Huffman encoder. For further detail see [ISO]. In our example we achieve good compression, only 7 code symbols are required to represent the 64 coefficients:

(10) (0, -2) (-1) (-1) (-1) (0, 0, -1) (End_of_Block)

The inverse DCT (**IDCT**) then computes a pixel block P from the
coefficient matrix C. This step is again a matrix multiplication with the DCT
matrix Q:

IDCT:

The **color space conversion** back to RGB is the final step in the
decompression process.

- color space conversion,
- FDCT/IDCT, and
- Huffman coding.

A two dimensional DCT, like any separable transform, can be implemented as two successive sets of 1D-DCTs [RaoYip]. Many different algorithms for 1D-DCTs have been proposed over the last 15 years. Chen's algorithm, based on sparse matrix factorization [ChenHarFra], reduced the number of arithmetic operations to 16 multiplications and 26 additions for an 8 point 1D-DCT. Others gave algorithms using even less multiplications [Wang] [Lee]. The theoretical lower bound for the number of multiplications is 11 [Duhamel]. [LoeLigMo] presented a class of such DCT algorithms.

In addition to the 2 consecutive 1D-DCTs more sophisticated methods for 2D-DCTs using true 2D methods have been published [KamRao] [ChoYunLee] [Vetterli]. They decrease the number of multiplications by a factor of 2, but increase the number of additions slightly.

With the arrival of advanced processor architectures, the determining factors for the computational cost have changed. An increasing number of general purpose processors is designed for very effective floating-point calculations. Some of them even provide single cycle multiply-accumulate instructions (MAC). Such processors are best used by algorithms with well balanced multiplications and additions. Exploitations of special instructions such as the one step MAC [LinFeig] focus on the reduction of the number of instructions rather than the reduction of multiplication operations.

Other DCT based compression schemes use frequency filtering [ThoBlinHias] to guarantee zero coefficients in well defined areas of the matrix, thus enabling tailored algorithms without calculations on coefficients of suppressed frequencies. The JPEG baseline algorithm as used in most telecommunication applications however does not produce well defined 'empty' areas in the coeffizient matrices. We propose a fully JPEG-compatible decompression method based on dynamically generated knowledge about the matrix occupancy without explicit frequency filtering in the compression step.

We decided on a parallel implementation with functional decomposition, where the 68040 parses the data stream, converts from YUV to RGB, and moves the data into the display RAM, because the DSP does not have direct access to the frame buffer. The DSP performs matrix reconstruction (Huffman) and IDCT steps. We chose Chen's 1-dimensional DCT scheme - which might be somewhat slower than recent DCT algorithms - as a basis for our implementation. But it proved advantageous for the DSP-typical multiply/accumulate architecture: it maintains a good balance between multiplications and additions.

Initially we achieved decompression rates ranging from 6 to 8 frames per second for 256x192 frames. In depth analysis of the compression scheme and typical data configurations lead to algorithmic improvements resulting in an 80% higher frame rate.

- the
**sparse population**in the 8x8 IDCT input matrices, and - the distribution of non-zero coefficients in these matrices or in other
words the
**shape**of the non-zero parts.

The analyzed image sequences are:

- News woman (474 frames, TV); mostly filled with the face of a woman reading the news. The background is mildly colored and blurred.
- News anchor (860 frames, TV); a sharper background with more detail. The background color is dark green and brown. Head and upper body of the anchor are visible.
- Office (1000 frames, high quality video camera); one of the authors in the office working at his workstation with bookshelves in the background. The background is bright and colored.
- Movie (1000 frames, VCR); Clint Eastwood having breakfast with another person in a bright room in the middle of Africa. An example for conferencing with more than one participant per part.
- Text (1000 frames, high quality video camera); a dictionary page with a lot of detail and only a few colors. During a conference, the video is sometimes used to show printed material such as charts.

**Figure 4**: One frame of the 'News anchor' sequence.

News News Office Movie Text woman anchor Y-plane 7.8 % 9.7 % 8.0 % 12.6 % 22.4 % U/V-planes 2.2 % 2.3 % 2.4 % 2.1 % 1.5 %

**Table 1:** Percentage of non-zero coefficients in reconstructed matrices.
Note that the chrominance planes contain less high frequency components
resulting in fewer non-zero coefficients.

Figure 5 visualizes the distribution of coefficients in the matrices for luminance (Y) and chrominance (U/V) planes. As expected most of the non-zero entries are found in the low frequency coefficient area.

**Figure 5**: Position of non-zero coefficients in reconstructed matrices.
Data from 1000 frames; image sequence 'Office'.

**Figure 6**: Percentage of matrices with a maximum of i, 1 <= i <= 8
occupied columns for the Y plane and U/V planes.

The graphs for the rows look similar. Note that the only image sequence with many fully occupied matrices is 'Text'. The reason is that JPEG is not very effective for images full of tiny detail, for example characters. They produce lots of high frequency components in the luminance plane.

**Figure 7**: Distribution of occupancy shapes in compressed image
sequences. Separated in vertically dominated (left), symmetric (middle) and
horizontally dominated (right) matrices.

The next figure (8) shows examples of typical distributions of non-zero coefficients.

**Figure 8**: Examples: Typical matrices show asymmetric distributions of
non-zero coefficients. Dark fields hold values unequal to zero.

The asymmetric characteristic of coefficient matrices can be exploited by adaptive transform coding methods to achieve better compression ratios [HoGersho]. For the case of the JPEG baseline algorithm, knowledge about the shape of non-zero parts in coefficient matrices can be used to improve decompression speed.

Numerical mathematicians have developed effective techniques for sparsely populated matrices. They are typically exploiting geometric properties of the nonempty regions such as band or triangular shape. Other important preconditions for specialized algorthms are certain mathematical properties of the matrices, e.g regularity, det(M) != 0, or positive definitness [Locher]. As shown above, IDCT input matrices typically have empty colums or rows (=> det(C) = 0) and they are not positiv definite. They are not band-shaped or triangular in the mathematical sense either (see figure 8).

The idea of utilizing the sparcity of coefficient matrices for speed-up has been presented before [HungMeng]. This work focused on the reduction of the number of multiplications needed for the IDCT. They identify and count zero values in coefficent matrices in order to select the best of the proposed IDCT algorithms. They do neither explain how to obtain structural information efficiently, nor do they optimize for processor arcitectures with multiply-accumulate instructions. Efficient extraction of information about coefficient matrices and their exploitation concerning instruction count is important and not trivial. Recent microprocessors need about the same instruction count for calculations with and test against zero. Selective execution following tests against zero needs even more instructions. Our objective is to avoid contact with coefficients equal to zero and not only to suppress calculations with zero values.

Many input matrices for the IDCT have only one non-zero coefficient in the left
upper corner (the DC value). Special treatment of this case, which avoids the
IDCT at all, achieved a speedup of 30%. It is simple to implement this special
case, a generic algorithm however needs to know *where* it can skip
operations dynamically.

The structural information gained in the matrix reconstruction is then used to control the adaptive IDCT transform algorithm. It is important to note, that structural information already available in the matrix reconstruction phase is recorded for further use in the IDCT phase.

where uk are IDCT coefficients and vj are the output values of the transformation.

In the first step a 1-dimensional transform is applied to each column, the second step is a loop that performs the 1-dimensional transform on each row (figure 9).

**Figure 9:** Separation of the 2D-DCT into two 1-dimensional transforms.
The first series results in a semi-transformed matrix. The second series
completes the 2D transform.

**Figure 10**: Transformation of non-zero columns

where again uk are IDCT coefficients and vj are the output values of the transformation. This reduced IDCT produces the same results as the regular IDCT over 8 coefficients with fewer calculations. A reduced IDCT of 8th degree is identical to the regular 8 point 1D-IDCT. Table 2 shows the number of instructions for different degrees of the reduced transformation.

degree 1 2 3 4 5 6 7 8 instructions 12 27 33 37 42 44 46 48

**Table 2**: Instruction count for the reduced 1-dimensional IDCT of r-th
degree. The lower degrees show by far the best reduction of the instruction
count.

In the second step after the column transformation a reduced IDCT of r-th degree is applied to a row containing r non-zero elements. Figure 11 gives an example with a reduced IDCT of 5th degree in the second step. The thicker part of the arrows on the right side shows the input values used.

**Figure 11**: Reduced row transform

Note that we did not tailor our algorithm to ignore individual zeros. The administrative overhead would certainly defeat the purpose of such an action.

**Figure 12a/b**: Distribution of non-zero entries in the semi-transformed
matrix after the first series of 1D-IDCTs. The number of non-zero entries
depends on the direction of the 1D-IDCT series.

The IDCT involves two matrix multiplications. They are of course associative:

This associativity enables the use of the occupation bitset to choose the best first step, i.e. perform the first transform in the direction that fills the matrix the least:

- * if #columns > #rows then transform rows first
- * if #columns <= #rows then transform columns first.

Figure 11a shows a typical block of coefficients with more occupied columns than rows. Transformations on the 6 occupied columns result in 48 of 64 fields with non-zero values. The second step then requires 8 reduced transformations of 6th degree over the rows. In contrast, if the transforms of the first step are applied to the 3 occupied rows we get only 24 of 64 fields with non-zero values. In this case the second step requires reduced transforms of only 3rd degree (figure 12b).

frame rate basic basic occupied +reduced best best [frames/sec] algorithm +DConly columns IDCT row/column row/column+DConly News Woman 7.6 9.9 9.7 11.8 12.6 14.1 (86%) News Anchor 7.3 9.0 9.2 10.9 11.5 12.5 (70%) Office 7.5 10.8 9.6 12.0 12.6 14.5 (92%) Movie 7.0 8.9 8.6 9.9 10.6 11.5 (65%) Text 6.1 7.9 7.0 8.0 8.1 8.8 (45%)

The subjective experience was verified experimentally in a test scenario measuring the performance of the different versions of the decompression algorithm.

The 'basic algorithm' uses the standard method of 2 series of Chen-IDCTs without other algorithmic improvements. However, the IDCT kernal is hand optimized and the code is the same as used in the best version (Table 3, last column). The 'occupied columns' version uses the proposed fast information extraction mechanism in the huffman step. The publicly available JPEG code version 5 of the Independent JPEG Group (IJG) performs in between the 'basic algorithm' and the 'occupied columns' version. It avoids the IDCT on columns containing only zero coefficients, but the decision wether to skip an IDCT or execute it, needs a significant time on processors with multiply accumulate arcitectures.

Table 3 gives the performance in frames per second, figure 13 shows the time for the IDCT. The percentages in the last column reflect the improvement compared to the basic algorithm. The total overhead to compute and use the structural information is about 9% of the overall decompression time in our implementation.

**Figure 13**: Time for IDCT computation with different acceleration
techniques (256x192 pixel per frame).

Our algorithm almost doubles the decompression frame rate for typical telecommunication video streams. We expect that this result and other algorithmic advances, for example in Huffman decoding, will pave the way for software based real time JPEG or MPEG processing in every workstation.

[ChoYunLee] Nam Ik Cho, Il Dong Yun, Sang Uk Lee: A Fast Algorithm for 2D-DCT; ICASSP '91; 1991; page 2197.

[Duhamel] P. Duhamel, H. H'Mida: New 2^{n} DCT Algorithms suitable for
VLSI Implementation; Proceedings IEEE International Conference Acoustics,
Speech and Signal Processing ICASSP-87, Dallas, Apr. 1987, page1805.

[HoGersho] Y.S. Ho, A. Gersho: Classified transform coding of images using vector quantization; ICASSP '89, Glasgow, May 1989; page 1890.

[HungMeng] Andy C. Hung, Teresa H.-Y. Meng: Statistical Inverse Discrete Cosine Transforms for Image Compression; Proceedings SPIE vol. 2187, page 196.

[ISO] International Organization for Standardization: Information Technology - Digital Compression and Coding of Continous-tone Still Images; ISO/IEC DIS 10918-1; ISO 1991.

[KamRao] F.A. Kamangar, K.R. Rao: Fast Algorithms for the 2-D Discrete Cosine Transform; IEEE Transactions on Computers, Vol. C-31, No. 9, Sept. 1982, page 899.

[Lee] Byeong Lee: A New Algorithms to compute the Discrete Cosine Transform; IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-32, No. 6, Dec. 1984, page 1243.

[LinFeig] E. Linzer, E. Feig: New Scaled DCT Algorithms for Fused Multiply/Add Architectures; IEEE Intl. Conference on Acoustics, Speech, and Signal Processing '91; 1991, page 2201.

[Locher] F. Locher: Numerische Mathematik für Informatiker; Berlin, 1992.

[LoeLigMo] Ch. Loeffler, A.Ligtenberg, G.S. Moschytz: Practical Fast 1-D DCT Algorithms with 11 Multiplications; IEEE Intl. Conference on Acoustics, Speech, and Signal Processing '89; 1989, page 988.

[MaChSwWo] L. Matterne, D. Chong, B. McSweeney, R. Woudsma: A flexible high performance 2-D discrete cosine transform IC; ISCAS 89, International Symposium Circuits and Systems; May 1989, page 618.

[RaoYip] K.R. Rao, P. Yip: Discrete Cosine Transform, Academic Press, 1990.

[ThoBlinHias] V. Thomas, J.L. Blin, M. Hias: Experiments on visibility thresholds of DCT Coefficients; Picture Coding Symposium 88; Italy, Sept 1988; page3.1-1.

[Vetterli] M. Vetterli: Fast 2-D Discrete Cosine Transform; ICASSP '85; 1985, page 1538.

[Wang] Z. Wang: Fast Algorithms for the Discrete W-Transform and for the Discrete Fourier Transform; IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-32, No. 4, Aug. 1984, page 803.