16.3 Huffman codes

16.3-1¶

Explain why, in the proof of Lemma 16.2, if $x.freq = b.freq$, then we must have $a.freq = b.freq = x.freq = y.freq$.

If we have that $x.freq = b.freq$, then we know that $b$ is tied for lowest frequency. In particular, it means that there are at least two things with lowest frequency, so $y.freq = x.freq$. Also, since $x.freq \le a.freq \le b.freq = x.freq$, we must have $a.freq = x.freq$.

16.3-2¶

Prove that a binary tree that is not full cannot correspond to an optimal prefix code.

Let $T$ be a binary tree that is not full. $T$ represents a binary prefix code for a file composed of characters from alphabet $C$, where $c \in C$, $f(c)$ is th number of occurrences of $c$ in the file. The cost of tree $T$, or the number of bits in the encoding, is $\sum_{c \in C} d_T(c) \cdot f(c)$, where $d_T(c)$ is the depth of character $c$ in tree $T$.

Let $N$ be a node of greatest depth that has exactly one child. If $N$ is the root of $T$, $N$ can be removed and the deepth of each node reduced by one, yielding a tree representing the same alphabet with a lower cost. This mean the original code was not optimal.

Otherwise, let $M$ be the parent of $N$, let $T_1$ be the (possibly non-existent) sibling of $N$, and let $T_2$ be the subtree rooted at the child of $N$. Replace $M$ by $N$, making $T_1$ be the children of $N$. If $T_1$ is empty, repeat the process. We have a new prefix code of lower cost, so the original was not optimal.

16.3-3¶

What is an optimal Huffman code for the following set of frequencies, based on
the first $8$ Fibonacci numbers?

$$a:1 \quad b:1 \quad c:2 \quad d:3 \quad e:5 \quad f:8 \quad g:13 \quad h:21$$

Can you generalize your answer to find the optimal code when the frequencies are the first $n$ Fibonacci numbers?

$$ \begin{array}{c|l} a & 1111111 \\ b & 1111110 \\ c & 111110 \\ d & 11110 \\ e & 1110 \\ f & 110 \\ g & 10 \\ h & 0 \end{array} $$

GENERALIZATION

In what follows we use $a_i$ to denote $i$-th Fibonacci number. To avoid any confusiion we stress that we consider Fibonacci's sequence beginning $1$, $1$, i.e. $a_1 = a_2 = 1$.

Let us consider a set of $n$ symbols $\Sigma = \{c_i ~|~ 1 \le i \le n \}$ such that for each $i$ we have $c_i.freq = a_i$. We shall prove that the Huffman code for this set of symbols given by the run of algorithm HUFFMAN from CLRS is the following code:

$code(c_n) = 0$
$code(c_{i - 1}) = 1code(c_i)$ for $2 \le i \le n - 1$ (i.e. we take a code for symbol $c_i$ and add $1$ to the beginning)
$code(c_1) = 1^{n - 1}$

By $code(c)$ we mean the codeword assigned to the symbol $c_i$ by the run of HUFFMAN($\Sigma$) for any $c \in \Sigma$.

First we state two technical claims which can be easily proven using the proper induction. Following good manners of our field we leave the proofs to the reader :-)

(HELPFUL CLAIM 1) $ (\forall k \in \mathbb{N}) ~ \sum\limits_{i = 1}^{k} a_i = a_{k + 2} - 1$
(HELPFUL CLAIM 2) Let $z$ be an inner node of tree $T$ constructed by the algorithm HUFFMAN. Then $z.freq$ is sum of frequencies of all leafs of the subtree of $T$ rooted in $z$.

Consider tree $T_n$ inductively defined by

$T_2.left = c_2$, $T_2.right = c_1$ and $T_2.freq = c_1.freq + c_2.freq = 2$
$(\forall i; 3 \le i \le n) ~ T_i.left = c_i$, $T_i.right = T_{i - 1}$ and $T_i.freq = c_i.freq + T_{i - 1}.freq$

We shall prove that $T_n$ is the tree produced by the run of HUFFMAN($\Sigma$).

KEY CLAIM: $T_{i + 1}$ is exactly the node $z$ constructed in $i$-th run of the for-cycle of HUFFMAN($\Sigma$) and the content of the priority queue $Q$ just after $i$-th run of the for-cycle is exactly $Q = (a_{i + 2}, T_{i + 1}, a_{i + 3}, \dots, a_n)$ with $a_{i + 2}$ being the minimal element for each $1 \le i < n$. (Since we prefer not to overload our formal notation we just note that for $i = n - 1$ we claim that $Q = (T_n)$ and our notation grasp this fact in a sense.)

PROOF OF KEY CLAIM by induction on $i$.

for $i = 1$ we see that the characters with lowest frequencies are exactly $c_1$ and $c_2$, thus obviously the algorithm HUFFMAN($\Sigma$) constructs $T_2$ in the first run of its for-cycle. Also it is obvious that just after this run of the for-cycle we have $Q = (a_3, T_{2}, a_4, \dots, a_n)$.
for $2 \le i < n$ we suppose that our claim is true for all $j < i$ and prove the claim for $i$. Since the claim is true for $i - 1$, we know that just before $i-th$ execution of the for-cycle we have the following content of the priority queue $Q=(a_{i + 1}, T_i, a_{i + 2}, \dots, a_n)$. Thus line 5 of HUFFMAN extracts $a_{i + 1}$ and sets $z.left = a_{i + 1}$ and line 6 of HUFFMAN extracts $T_i$ and sets $z.right = T_i$. Now we can see that indeed $z$ is exactly $T_{i + 1}$. Using (CLAIM 2) and observing the way $T_{i + 1}$ is defined we get that $z.freq = T_{i + 1}.freq = \sum\limits_{i=1}^{i + 1} a_i$. Thus using (CLAIM 1) one can see that $a_{i + 2} < T_{i + 1}.freq < a_{i + 3}$. Therefore for the content of the priority queue $Q$ just after the $i$-th execution of the for-cycle we have $Q=(a_{i + 2}, T_{i + 2}, a_{i + 3}, \dots, a_n)$.

KEY CLAIM tells us that just after the last execution of the for-cycle we have $Q = (T_n)$ and therefore the line 9 of HUFFMAN returns $T_n$ as the result. One can easily see that the code given in the beginning is exactly the code which corresponds to the code-tree $T_n$.

16.3-4¶

Prove that we can also express the total cost of a tree for a code as the sum, over all internal nodes, of the combined frequencies of the two children of the node.

Let tree be a full binary tree with $n$ leaves. Apply induction hypothesis on the number of leaves in $T$. When $n = 2$ (the case $n = 1$ is trivially true), there are two leaves $x$ and $y$ with the same parent $z$, then the cost of $T$ is

$$ \begin{aligned} B(T) & = f(x) d_T(x) + f(y) d_T(y) \\ & = f(x) + f(y) & \text{since $d_T(x) = d_T(y) = 1$} \\ & = f(\text{child}_1\text{ of }z) + f(\text{child}_2\text{ of }z). \end{aligned} $$

Thus, the statement of theorem is true. Now suppose $n > 2$ and also suppose that theorem is true for trees on $n - 1$ leaves. Let $c_1$ and $c_2$ are two sibling leaves in $T$ such that they have the same parent $p$. Letting $T'$ be the tree obtained by deleting $c_1$ and $c_2$, by induction we know that

$$ \begin{aligned} B(T) & = \sum_{\text{leaves } l'\in T'} f(l')d_T(l') \\ & = \sum_{\text{internal nodes } i'\in T'} f(\text{child}_1\text{ of }i') + f(\text{child}_2\text{ of }i'). \end{aligned} $$

Using this information, calculates the cost of $T$.

$$ \begin{aligned} B(T) & = \sum_{\text{leaves }l \in T} f(l)d_T(l) \\ & = \sum_{l \ne c_1, c_2} f(l)d_T(l) + f(c_1)d_T(c_1) - 1 + f(c_2)d_T(c_2) - 1 + f(c_1) + f(c_2) \\ & = \sum_{\text{internal nodes }i'\in T'} f(\text{child}_1\text{ of }i') + f(\text{child}_2\text{ of }i') + f(c_1) + f(c_2) \\ & = \sum_{\text{internal nodes }i\in T} f(\text{child}_1\text{ of }i) + f(\text{child}_1\text{ of }i). \end{aligned} $$

Thus the statement is true.

16.3-5¶

Prove that if we order the characters in an alphabet so that their frequencies are monotonically decreasing, then there exists an optimal code whose codeword lengths are monotonically increasing.

Little formal-mathematical note here: We are required to prove existence of an optimal code with some property. Therefore we are required also to show, that some optimal code exists. It is trivial in this case, since we know that the code produced by a run of Huffman's algorithm produce one such code for us. However, it is good to be aware of this. Proving just the implication "if a code is optimal then it has the desired property" doesn't suffice.

OK, now we are ready to prove the already mentioned implication "if a code is optimal then it has the desired property". Main idea of our proof is that if the code violates desired property, then we find two symbols which violate the property and 'fix the code'. For the formal proof we go as follows.

Suppose that we have an alphabet $C = {a_1, \ldots, a_n}$ where the characters are written in monotonically decreasing order, i.e. $a_1.freq \ge a_2.freq \ge \ldots \ge a_n$. Let us consider an optimal code $B$ for $C$. Let us denote the codeword for the character $c \in C$ in the code $B$ by $cw_B(c)$. W.l.o.g. we can assume that for any $i$ such that $a_i.freq = a_{i + 1}.freq$ it holds that $|cw(a_i)| \le |cw(a_{i + 1})|$. This assumption can be made since for any $a_i.freq = a_{i + 1}.freq$ for which $|cw(a_i)| > |cw(a_{i + 1})|$ we can simply swap codewords for $a_i$ and $a_{i + 1}$ and obtain a code with desired property and the same cost as is the cost of $B$. We prove that $B$ has the desired property, i.e., its codeword lengths are monotonically increasing.

We proceed by contradiction. If lengths of the codewords are not monotonically increasing, then there exist an index $i$ such that $|cw_B(a_i)| > |cw_B(a_{i + 1})| $. Using our assumptions on $C$ and $B$ we get that $a_i.freq > a_{i + 1}.freq$. Define new code $B'$ for $C$ such that for $a_j$ such that $j \ne i$ and $j \ne i + 1$ we keep $cw_{B'}(a_j) = cw_B(a_j)$ and we swap codewords for $a_i$ and $a_{j + 1}$, i.e. we set $cw_{B'}(a_i) = cw_{B}(a_{i + 1})$ and $cw_{B'}(a_{i + 1}) = cw_{B}(a_{i})$. Now compare costs of the codes $B$ and $B'$. It holds that

$$ \begin{aligned} cost(B') &= cost(B) - (|cw_B(a_i)|(a_i.freq) + |cw_B(a_{i + 1})|(a_{i + 1}.freq)) \\ &+ (|cw_B(a_i)|(a_{i + 1}.freq) + |cw_B(a_{i + 1})|(a_{i}.freq)) \\ &= cost(B) + |cw_B(a_i)|(a_{i + 1}.freq - a_i.freq) + |cw_B(a_{i + 1})|(a_i.freq - a_{i + 1}.freq) \end{aligned} $$

For better readability now denote $a_i.freq - a_{i + 1}.freq = \phi$. Since $a_i.freq > a_{i + 1}.freq$, we get $\phi > 0$ and we can write

$$ cost(B') = cost(B) - \phi|cw_B(a_i)| + \phi|cw_B(a_{i + 1})| = cost(B) - \phi(|cw_B(a_i)| - |cw_B(a_{i + 1})|) $$

Since $|cw_B(a_i)| > |cw_B(a_{i + 1})| $, we get $|cw_B(a_i)| - |cw_B(a_{i + 1})| > 0$. Thus $\phi(|cw_B(a_i)| - |cw_B(a_{i + 1})|) > 0$ which imply $cost(B') < cost(B)$. Therefore the code $B$ is not optimal, a contradiction.

Therefore, we conclude that codeword lengths of $B$ are monotonically increasing and the proof is complete.

Note: For those not familiar with mathematical parlance, w.l.o.g means without loss of generality.

16.3-6¶

Suppose we have an optimal prefix code on a set $C = \{0, 1, \ldots, n - 1 \}$ of characters and we wish to transmit this code using as few bits as possible. Show how to represent any optimal prefix code on $C$ using only $2n - 1 + n \lceil \lg n \rceil$ bits. ($\textit{Hint:}$ Use $2n - 1$ bits to specify the structure of the tree, as discovered by a walk of the tree.)

First observe that any full binary tree has exactly $2n - 1$ nodes. We can encode the structure of our full binary tree by performing a preorder traversal of $T$. For each node that we record in the traversal, write a $0$ if it is an internal node and a $1$ if it is a leaf node. Since we know the tree to be full, this uniquely determines its structure.

Next, note that we can encode any character of $C$ in $\lceil \lg n \rceil$ bits. Since there are $n$ characters, we can encode them in order of appearance in our preorder traversal using $n\left\lceil \lg n \right\rceil$ bits.

16.3-7¶

Generalize Huffman's algorithm to ternary codewords (i.e., codewords using the symbols $0$, $1$, and $2$), and prove that it yields optimal ternary codes.

Instead of grouping together the two with lowest frequency into pairs that have the smallest total frequency, we will group together the three with lowest frequency in order to have a final result that is a ternary tree. The analysis of optimality is almost identical to the binary case. We are placing the symbols of lowest frequency lower down in the final tree and so they will have longer codewords than the more frequently occurring symbols.

16.3-8¶

Suppose that a data file contains a sequence of $8$-bit characters such that all $256$ characters are about equally common: the maximum character frequency is less than twice the minimum character frequency. Prove that Huffman coding in this case is no more efficient than using an ordinary $8$-bit fixed-length code.

For any $2$ characters, the sum of their frequencies exceeds the frequency of any other character, so initially Huffman coding makes $128$ small trees with $2$ leaves each. At the next stage, no internal node has a label which is more than twice that of any other, so we are in the same setup as before. Continuing in this fashion, Huffman coding builds a complete binary tree of height $\lg 256 = 8$, which is no more efficient than ordinary $8$-bit length codes.

16.3-9¶

Show that no compression scheme can expect to compress a file of randomly chosen $8$-bit characters by even a single bit. ($\textit{Hint:}$ Compare the number of possible files with the number of possible encoded files.)

If every possible character is equally likely, then, when constructing the Huffman code, we will end up with a complete binary tree of depth $7$. This means that every character, regardless of what it is will be represented using $7$ bits.

This is exactly as many bits as was originally used to represent those characters, so the total length of the file will not decrease at all.