JPEG Series, Part I: Visualizing the Inverse Discrete Cosine Transform

April 18, 2021

A key step in JPEG image compression is converting 8-by-8-pixel blocks of color values into the frequency domain, so instead of storing color values, we store amplitudes of sinusoidal waveforms. This is a fun little bit of applied math, and you might enjoy seeing how it works.

It all really started in the early 1820’s, when Joseph Fourier figured out that any periodic waveform can be broken down into a sum of sinusoids. Kalid Azad has explained this much better than I could over at BetterExplained, and if you are not familiar with the Fourier transform, I recommend you go learn about it from Kalid first. I’ll be waiting right here.

Welcome back! Let’s apply what you learned to a block of pixel values. How about this block of pixels right here?

Take a horizontal slice, 8 pixels wide, from that block. Take the position of each pixel as an x value and its brightness as a y value. Then the 8 pixels correspond to 8 (x, y) points on a plane, and we could find a combination of sinusoidal waves that would go through those 8 points.

We could, and we will. Let’s do it now. Click on any row of the pixel grid below:

The heavy, black waveform is the sum of all the colored sinusoids. The height of the 8 black dots represent the brightness values of the 8 pixels in the selected row. Try comparing several different rows to see that in each case, the height of the dots matches the brightness of the pixels.

You might notice that for darker pixel values, the dots appear below the 'zero line', while for brighter pixels, they appear above it. This is because we subtracted 128 from each brightness value before converting to a waveform, so the range of possible values (0-255) would be centered on the zero line. This is also done when an image is stored in JPEG format.

You might have also noticed that one of the component waves doesn’t look like a sinusoid; it’s the red one. It is just a flat line. That is the zero frequency component; it represents the average of the 8 values. It shifts the black waveform up or down to just the right height for it to hit all 8 target points.

The legend displays the frequency and amplitude (on a scale of zero to one) of each component wave. Try clicking on the color swatches in the legend to see the component waves more clearly.

Of course, we could do exactly the same with columns of 8 pixels:

Now, you have seen that each row or column of pixel values in this "coffee cup" icon can be converted to a sum of sinusoidal waves. But could that just be a fluke? Can we really do this with any sequence of eight brightness values?

If the answer is obvious, just humour me here. Go back and try dragging any of the black points up or down. The pixel colors and waveform will update as you drag.

Looks cool, doesn’t it?

Now... we need to talk.

I have misled you here. The link to BetterExplained above probably tricked you into thinking that these waveforms were derived using a Fourier transform. Not so. This page is about the Discrete Cosine Transform, not the Fourier transform. But it was good for you to understand the idea of the Fourier transform before learning about the DCT.

If you take some time to play with both the Fourier transform demonstration on BetterExplained and the DCT demonstration here, you might recognize some differences between these transforms. Even when given the same input, they produce a different series of component waves. (That’s an interesting point; the same sequence of discrete time samples can be broken down into sinusoids in more than one way.)

Do you want to go back and give it a try? Either way, whenever you are ready, click to reveal two major differences:

Other differences which you can’t see from the demonstrations are:

While the computation of the Fourier transform uses complex numbers, the DCT only involves real numbers.
The DCT is easier to compute. It’s just a simple nested loop which evaluates a cosine function and a couple of adds and multiplies for each input sample.

Now, another important point. Look back at the interactive DCT. As you drag the target points up and down, what is the maximum number of component waves which are required to match the 8 target points?

Right. That is a key point for JPEG image compression. Remember, in a JPEG image, blocks of 8 pixels by 8 pixels are represented as a combination of component waves. The amplitude of each component is called a DCT coefficient. Since each block of input pixels converts to a fixed number of component waves, we just need to store a fixed number of coefficients for each such block. We don’t need to store their frequencies, since those are known and are always the same. Nor do we need to store phase shifts, because all are at a constant phase.

Let’s move into two dimensions now. We have demonstrated that if we just needed to represent 8 color values in a row, 8 coefficients would be enough. But how many coefficients do we need to represent 8-by-8, or 64, color values? Guess before clicking:

You might ask: Isn’t JPEG a lossy format which compresses images into fewer bits? How can converting 64 numbers into 64 other numbers save storage space?

You are right; by itself, applying the DCT to a block of color samples doesn’t result in any compression. It’s like translating English text to French or Chinese; you are representing the same information in a different form. So it’s not surprising that 64 color samples convert to 64 coefficients. However, the DCT is still a key step in achieving image compression. More on this later.

I want to show you examples of 8-by-8 images broken down into 64 two-dimensional waveforms. As in the one-dimensional case, the frequencies and phase of the 64 components will always be the same; only their amplitudes will vary. Before looking at any sample images, though, first let me show you the 64 components of the two-dimensional DCT at a fixed amplitude.

On the left is an 8 by 8 grid. In each cell is a (u, v) pair. (We talk about positions in a DCT coefficient matrix using u, v coordinates; coordinates in the corresponding block of pixels are named x and y.) On the right is a waveform graph. Click on each position in the coefficient matrix to see what the corresponding waveform looks like. You can click and drag on the graph to pivot.

Do you see how the 64 components in the two-dimensional case correspond to the 8 components in the one-dimensional case? Look again at the waveform for coefficient (0, 0), in the top-left corner. This one is very important, since it gives the average value of all 64 color samples in a block. It is called the DC coefficient. The other 63 coefficients represent all the deviations from the average and are called the AC coefficients.

Now you are ready to see the Inverse Discrete Cosine Transform at work. Click on each of the sample images below to see its DCT coefficients. Click on any coefficient to disable it and see what the image looks like with it removed; or Control-click to enable only a single coefficient and see its contribution to the image.

Here’s a hint of one interesting thing to look for. If you try disabling various coefficients in the images, which coefficients generally seem to have a smaller effect on the picture? (Those at the top of the matrix, the left, right, bottom, or towards a certain corner?) This has much to do with JPEG compression. More on that towards the end of the post...

If you’ve made it this far, you might want to know how the DCT and IDCT are calculated. (Don’t care about the math? Feel free to skip it.) For simplicity, we’ll stick to eight pixel values in one dimension (a row or column). The two-dimensional transforms are very similar.

First, the DCT, which converts pixel values to coefficients:

S_{u} = \frac{1}{2} C_{u} x = 0 \sum 7 s_{x} cos \frac{( 2 x + 1 ) π u}{16}

where

C_{0} = \frac{1}{2}, C_{1 - 7} = 1

That’s quite a mouthful in English: to calculate the DCT coefficient u, loop over all eight pixels and sum up: the pixel value times the cosine of: twice the pixel’s x coordinate plus one, times π, times u, divided by 16. Halve the total. Further, if u is zero, divide the total again by root 2.

Or if you speak JavaScript:

const coefficients = [];
for (var u = 0; u < 8; u++) {
  var coeff = 0;

  for (var x = 0; x < 8; x++)
    coeff += pixelValues[x] * Math.cos(((2 * x) + 1) * Math.PI * u / 16);

  coeff /= 2;

  if (u === 0)
    coeff /= Math.sqrt(2);

  coefficients.push(coeff);
}

Note that $(2 x + 1) π /16$ ranges from just above zero to just below π; that is, half a complete cycle for the cosine function. So when u is one, the component wave only makes half a cycle as x moves from zero to seven. If u is two, then the input to the cosine function increments twice as fast, and a full cycle is made. Each increment of u increases the frequency of the component wave by 0.5Hz.

On the other hand, when u is zero, the cosine always evaluates to one, and we are essentially just summing up the eight pixel values.

Essentially, coefficient u expresses how well the eight values "fit" a (0.5u)Hz cosine wave. Each value which is positive where the cosine wave is positive (or negative where it is negative) increases the coefficient. Each value which is of opposite sign to the cosine wave at its location on the x-axis decreases the coefficient. If the coefficient value is very positive, that means the sequence of samples closely fits a cosine wave; or if the coefficient value is very negative, that means the samples are close to the opposite of a cosine wave (that is, a cosine wave shifted by 180 degrees).

The Inverse DCT is very similar. Shall we stick to JavaScript this time?

const pixelValues = [];
for (var x = 0; x < 8; x++) {
  var value = coefficients[0] / Math.sqrt(2);

  for (var u = 1; u < 8; u++)
    value += coefficients[u] * Math.cos(((2 * x) + 1) * Math.PI * u / 16);

  value /= 2;

  pixelValues.push(value);
}

Let me share another thing which I find fascinating about the DCT. Actually, no; let me show you and see if you can recognize it yourself.

Earlier I showed you waveform graphs demonstrating how the DCT converts a sequence of discrete color samples to a sum of sinusoid waves. The graphs were bounded tightly around the eight samples on the X-axis. This time let’s stretch out the X-axis and let the waves carry on to the left and right. I will draw grey dots at evenly spaced intervals, so you can see if the same pattern repeats itself every eight time units or not.

Look carefully at the pattern created when the waves derived from the DCT are extended to the left and right. Is it simply repeating the same pattern every eight time units? Or...what?

This is part of why the DCT is useful for image compression. When an image is broken down into blocks of pixels, the color of the left edge of a block will often be different from the right edge, and likewise for the top and bottom edges. If we used a discrete Fourier transform on those color samples, the discontinuity between the colors of opposing edges would tend to produce strong high-frequency component waves. (When breaking a waveform down into sinusoids, any sharp "jumps" result in strong high-frequency components.) But since the DCT, in effect, buts the block up with a mirror image of itself on each side, that discontinuity doesn’t exist, and the high-frequency components will usually be much weaker.

I still haven’t told you why the DCT is useful for image compression. First, understand that not all the information contained in an image is equally important or noticeable to a human viewer. It happens that converting color samples to the frequency domain concentrates the information which is most detectable by our visual system in the coefficients at the top-left of the DCT matrix. Conversely, the information which is least perceptible to our visual system is concentrated in the coefficients at the bottom-right.

In this way, the DCT sets things up for subsequent stages of compression to work their fullest effect. First, quantization. This stage throws away part of the data in the less-significant bits of the DCT coefficients. Since we know that the values of the coefficients towards the bottom-right have less of an effect on what we see, those can be heavily quantized, while retaining more bits of the coefficients toward the top-left. That means we can discard a significant amount of data with little effect on visual quality.

The DCT works synergistically with quantization and the zig-zag ordering of coefficients to make the final entropy coding stage more effective. This stage applies a lossless compression algorithm to the quantized coefficients.

Many images will have smaller coefficient values toward the bottom right of the matrix, and after quantization is applied, these may become zeroes. So the zig-zag ordering of coefficients will tend to produce runs of zeroes toward the end of each block. Those consecutive zeroes can then be represented using an efficient run-length encoding.

Interestingly, both the WebP and AVIF compressed image formats also transform color samples into the frequency domain. Both can use either the Discrete Cosine Transform or a different transform which serves a similar purpose.

The other popular compressed image formats are PNG and GIF. Neither of these transform samples into the frequency domain.

The next post in this series will explore Huffman coding, a lossless compression algorithm which is another key ingredient of JPEG.