JPEG Series, Part II: Huffman Coding
The previous article in this series explored how JPEG compression converts pixel values to DCT coefficients. A later stage of the compression process uses either a method called "Huffman coding" or another called "arithmetic coding" to store those coefficients in a compact manner. The Huffman coding algorithm is very simple, but powerful and widely used. If you’ve never learned how it works, I promise this will be interesting.
You have some data to compress. Your data can be viewed as a sequence of values; perhaps Unicode codepoints, pixel color samples, audio amplitude samples, or something comparable. In the context of Huffman coding, each of these values is called a "symbol".
We are going to encode each symbol using a unique, variable-length string of bits. For example, if each symbol is a letter, the letter "a" could be "10111", "b" could be "00011", and so on. We can pick any series of bits we like to represent each symbol, with the restriction that after all the symbols are converted to bitstrings, and all those little bitstrings are smashed together, it must be possible to figure out what the original sequence of symbols was.
That means the bitstrings for each symbol must be prefix-free; none of them can be a prefix of another. Example of a non-prefix-free code: say we choose "1100" to represent the letter "a", "11" to represent "b", and "00" for "c". When decoding text, we find "1100" somewhere. How are we supposed to know whether that was originally a letter "a", or the two letters "bc"? It’s impossible to tell, precisely because "11" is a prefix of "1100". Fortunately, Huffman codes are always prefix-free.
Let’s encode this sentence with such a code. Click the button below, and the computer will generate a random prefix-free code. Try clicking a number of times and see what the smallest total number of bits required to encode the sentence appears to be.
Obviously, the number of bits required to encode a sequence can vary wildly depending on the chosen code. Can you see why some prefix-free codes are more efficient than others? What is the key difference between an efficient code and an inefficient one? (Click to reveal.)
Out of the vast number of prefix-free codes which could be used, we want to find an optimal one; one which will encode our particular data in the smallest number of bits. (There will always be many optimal codes for any sequence, but we just need to find one of them.) At first, it might appear that we need to try millions of possible codes to be sure that we have an optimal one. Fortunately, that is not the case. Just count how many times each symbol appears in the input data, and in an almost trivially simple way, you can find an optimal prefix-free code. It will be easy, and fun!
I could tell you the algorithm right now, but it will be so much more enjoyable to discover it yourself. So I’ll take this slow, and reason towards a solution step by deliberate step. If at any point you catch the scent of the solution, stop and think it out before continuing.
The first step is to represent a prefix-free code as a binary tree. Have a look:
Please make sure that you clearly see the correspondence between coding tables and binary trees before continuing.
We can see that the number of bits used to encode each symbol equals the number of links between its tree node and the root node, also called the "depth" of the node. Leaf nodes which are closer to the root (smaller depth) have shorter bitstrings.
We will add a weight to each leaf node, which is simply the number of times its symbol appears in the input data:
Now the total length in bits of the compressed output will equal weight times depth, summed over all the leaf nodes.
So now our goal is to find a binary tree which minimizes the sum of weight times depth. We don’t really have an idea how to do that, though. At least we do know what the leaf nodes of the tree should be:
How are we going to find the right structure for the internal nodes? Well, we could try to do it top-down, meaning we figure out what child nodes the root node should have, then the nodes below those, and so on. Or we could work bottom-up, meaning we figure out which leaf nodes should become children of the same parent node, then find which parent nodes should be children of the same "grandparent" node, until the whole tree is joined together. A third option would be to work both up and down from the middle, but that is just as hopeless as it sounds. These animations may help you understand "top-down" and "bottom-up" tree construction:
Bottom-up
Top-down
To build an optimal tree top-down, we would need a way to partition the symbols into two subsets, such that the total weights of each subset are as close as possible to 50%-50%. That might be tricky. On the other hand, if we can come up a simple criterion to identify two leaf nodes which should be siblings in the tree, we might be able to apply the same criterion repeatedly to build an optimal tree bottom-up. That sounds more promising.
Before we consider that further, take note of an important fact. How many internal nodes, including the root, does it take to connect N leaf nodes together into a binary tree? Watch the above animations again and try to figure it out:
Good. Another one: When building a tree bottom-up, every time we pick two subtrees and join them together as children of a new internal node, what happens to the depth of all the leaves in the combined subtree?Remember that the depth of each leaf node equals the number of bits required to encode the corresponding symbol. So every time we join two subtrees, we are in a sense "lengthening" the bitstrings for all the symbols in the new subtree. Since we want the most common symbols to have the shortest bitstrings (equivalent: we want their nodes to be closest to the root), they should be the last ones to be joined into the tree.
With that in mind, can you now see what the first step in building an optimal tree bottom-up should be?
Yes! Just like this:
Just another small conceptual leap, and the complete solution will be ours. Here’s what we need to figure out: Just now, we took the two lowest-weighted leaf nodes and joined them together. But how should we "weight" the resulting subtree? How will we know when and where to join it into a bigger subtree? More concretely: for the second step, we could either take our new 3-node subtree, and use it as one child of a new 5-node subtree, or we could pick two of the remaining single nodes, and join them into another 3-node subtree. How do we decide which choice is better?
.
.
.
Think about it this way. When we attach a single node into a subtree, the bitstring representation for its symbol is being "lengthened" by one bit. In a sense, it’s like the total bit length of the final encoded message is being increased by the weight of the node.
When we attach a subtree into a bigger subtree, the same thing happens to all the leaf nodes in the subtree. All of their bitstrings are growing by one bit, so the final encoded message size is growing by the sum of their weights.
That was a giveaway if there ever was one. So answer now, how should we weight subtrees which contain multiple leaf nodes?
And then what is our algorithm for building an optimal tree bottom-up?
Type some text in the below entry field, and I’ll animate the process for you:
Yes, that tree represents an optimal prefix-free code!
That can’t be hard to code, can it? (It’s not.) One thing, though: Since the symbol set might be large, we need a data structure which allows quick retrieval of the two lowest-weighted subtrees at each step. A minheap fits the bill perfectly. Here’s an minimal implementation using JavaScript Arrays:
It would be fun to animate the minheap operations and show you how they work, but that would have to be a different article.
The rest of the code to build Huffman trees is almost anticlimactic:
Modifying the Basic Algorithm for JPEG
The Huffman codes generated above have two important differences from those used to compress pixel data in JPEG files.
Difference #1: JPEG Huffman tables never use bitstrings which are composed of only 1’s. "111" is out. "1111" is forbidden. And you can just forget about "111111".
BUT WHY? Because while sections of Huffman-coded data in a JPEG file must always occupy a whole number of 8-bit bytes, all those variable-length bitstrings will not necessarily add up to a multiple of 8 bits. If there are some extra bits left to fill in the last byte, "1" bits are used as padding. If bitstrings composed of only 1’s were used, the padding in the last byte could be mistakenly decoded as an extraneous trailing symbol. By avoiding such bitstrings, it is always possible to recognize the padding.
How can we modify our algorithm to account for that? Can you think of an idea?
That just takes a few more lines of code:
This is optimal tree construction with a dummy node:
Difference #2: JPEG Huffman codes are always canonical.
In a canonical Huffman code, when the bitstrings are read as binary numbers, shorter bitstrings are always smaller numbers. For example, such a code could not use both "000" and "10", since the former bitstring is longer, but is a smaller binary number. Further, when all the bitstrings used in the code are sorted by their numeric value, each successive bitstring increments by the smallest amount possible while remaining prefix-free. Here’s an example, courtesy of Wikipedia:
0 |
10 |
110 |
111 |
Interpreted as numbers, those are zero, two, six, and seven. Why wasn’t the second bitstring "01", or one? Because then the first would have been its prefix. Likewise, if the third was "011" (three), "100" (four), or "101" (five), in each case one of the first two would have been a prefix. For the fourth, incrementing by one to "111" didn’t create a prefix, so "111" it is. (Hopefully that example gives you the idea; hit me up if you need more!)
But WHY does JPEG use canonical codes? Because their coding tables can be represented in a very compact way[1], which makes our JPEG files smaller and faster to decode. (Yes, JPEG files must contain not just Huffman-encoded pixel data but also the coding tables which were used.)
So given a symbol set and frequencies, how can we generate a canonical Huffman code? Unfortunately, there is no straightforward way to do it directly by building a binary tree. But we can use our existing method to generate a non-canonical (but optimal) code, and then rewrite the bitstrings to make them canonical while maintaining their length. Remember, it’s the length of the bitstrings assigned to each symbol which makes a prefix-free code optimal. The exact bitstrings which are used don’t matter; we can shuffle them around and assign different ones with the same length.
The algorithm suggested in the JPEG specification (Appendix K) gets a step ahead of the game by not explicitly building a binary tree with left and right child pointers. It just tracks what the depth of each leaf node would have been had they actually been built into a binary tree. So these depths can be incremented whenever two "subtrees" are "joined together", the leaf nodes for each subtree are kept on a linked list. "Subtrees" are "joined" by concatenating their linked lists. (Libjpeg uses this trick when saving a Huffman-encoded JPEG file.[2])
Regardless of whether you actually build a binary tree or use the trick from Appendix K, once you know what the lengths of all the bitstrings in an optimal code should be, generating a canonical code is as simple as this:
Here is an example. Note that we are not using a dummy, so bitstrings with all 1 bits may be included.
Random Code | Sorted by Bitstring Length | Canonicalized |
---|
Huffman Coding in Practice
All through this article, ASCII characters have been used as Huffman symbols. But in reality, if you want to compress English text, Huffman coding with each character treated as a separate symbol would be a terrible way to do it. Note two big weaknesses with that approach:
- Huffman coding is oblivious to patterns which involve the order of symbols. It only cares about their frequency. But real-life data usually has patterns related to the order of values, which can be exploited to achieve better compression.
- Huffman coding always uses at least one bit for each symbol, and usually much more. So in the "ideal" case of a text file which just contains a single ASCII character repeated thousands of times, Huffman coding with one symbol per letter could only compress it to ⅛ of its original size. 8× compression may sound good, but any reasonable compression method should get far greater gains in that ridiculously easy-to-compress case.
So just what am I saying here? Is Huffman coding a bad algorithm?
Not at all! But it is just one piece of a practical compression method; it’s not a complete compression method by itself. And to make Huffman coding work to greatest advantage, it may be necessary to find an alternative data representation which is well-suited to such coding. Just taking the most "natural" or intuitive representation and directly applying Huffman coding to it will probably not work well.
As an example, in JPEG, the values which we want to compress are quantized DCT coefficients (see the previous post for details), which have 8 bits of precision each.[3] We could take the 256 possible coefficient values as 256 Huffman symbols and Huffman-code them directly, but this would be very suboptimal.
In the symbol set which is actually used, each symbol represents either:
- Some specific number of successive zero coefficients (0-15 of them), and the number of significant bits in the following non-zero coefficient.
- A run of zeroes filling the remainder of a 64-coefficient block.
Note that each symbol only tells us the number of significant bits in the next non-zero coefficient, not what those bits actually are. The actual coefficient value bits are simply inserted into the output data stream uncompressed. This is because the values of non-zero DCT coefficients don’t actually repeat very much, so Huffman-coding them wouldn’t really help. (See the demonstration in the previous post. Does it look like the coefficients within an 8-by-8 DCT matrix repeat much?) However, since the Huffman symbols tell us the number of significant bits, high-order zero bits can be discarded, which does help significantly.
JPEG files can use "arithmetic coding" as an alternative to Huffman coding (although this is not common). I dare say arithmetic coding is a more intriguing and fascinating algorithm than Huffman coding. So it will not surprise you that the next article in this series will focus on arithmetic coding. See you then!
[1] With a canonical code, only the number of bitstrings used of each possible length needs to be stored; how many are 1 bit long, how many 2 bits long, how many 3 bits long, and so on. The actual bitstrings can be quickly recovered from that. ⏎
[2] But interestingly, libjpeg does not use a minheap when generating a Huffman code. Instead, it uses an array of symbol frequencies, and scans the whole array at each step to find the two lowest-weighted subtrees. ⏎
[3] The JPEG standard actually allows DCT coefficients to be either 8-bit or 12-bit, but 8 bits is almost universally used. Libjpeg can theoretically handle JPEG files with 12-bit coefficients, but it must be specially configured to do so at compile time, and binary distributions are not generally built in that way. ⏎