Search code examples
huffman-codehuffman-tree

Huffman coding rules and optimization


All sources i've read quote the below process to get Huffman codes:

  • Arrange the elements in ascending order of their frequencies.
  • Create a binary tree by repeatedly combining the two least frequent elements into a single node, with the sum of their frequencies as the frequency of the new node.
  • Assign '0' to the left branch and '1' to the right branch at each step.
  • The Huffman code for each element is obtained by traversing from the root of the tree to the respective leaf node, recording the path taken (0 for left, 1 for right).

Let's try this for the following Frequency Map:

A -> 1 B -> 2 C -> 3 D -> 4

Making the Tree

        (ABCD, 10)
       /          \
   (ABC, 6)         D:4
   /     \
 (AB, 3)  C:3
 /    \
A:1   B:2

Following this tree to get the codes:

A -> 000 B -> 001 C -> 01 D -> 1

Obviously, the Max depth and the max code length does not follow the logical rule of being log2(nElement), which in this case should be two.

Optimally the codes would be:

A -> 00 B -> 01 C -> 10 D -> 11

So what's goin on here ? what am i doing wrong ?


Solution

  • Imagine encoding String "ABBCCCDDDD" - sample string where each symbol occurs with the given frequency.

    Huffman encoding you came up with encodes it as 0000010010101011111 - 19 characters.

    Your encoding where you optimized for max code length gives 00010110101011111111 - 20 characters.

    Your mistake is thinking that max depth is what's being optimized (or what needs to be optimized) - but Huffman encoding optimizes for encoded text length.

    And as a side note, as I mentioned in a comment, Huffman produces "an" optimal solution, not "the" optimal solution, it's possible I think for a lower-depth solution to be just as good, but it won't be better.