Search code examples
algorithmproofhuffman-code

proving that huffman's algorithm can produce a codeword of length 1 when frequency greater than 0.40


If I have a set of symbols and frequencies: A - 0.1 B - 0.40 C - 0.2 D - 0.23 E - 0.15 F - 0.17

The Huffman algorithm will produce codewords that are only greater than length 1.

But when I change a frequency to be greater than 0.40, it will produce a codeword of length 1 and greater. How can construct a proof that proves that this is the case for any set of symbols, not just this one?


Solution

  • (Note that your frequencies don't add to 1; I'll assume it's a typo)

    Here is a sketch of a proof that to make all codewords greater than 1 bit, no frequency can be greater than 2/5. Without loss of generality, the huffman tree must look like this:

        a+b+c+d (the sum must be equal to 1)
         /   \
      a+b     c+d
      / \     / \
     a   b   c   d
    

    We must prove that all of a, b, c, and d are no greater than 2/5.

    WLOG (again) a = b <= c <= d.

        2a+c+d
         /   \
       2a     c+d
      / \     / \
     a   a   c   d
    

    Let's find the maximal value of d that is consistent with this Huffman tree. According to how the algorithm works, the following inequalities hold:

    • a <= c
    • a <= d
    • 2a >= c
    • 2a >= d

    Let's also replace c by 1-d-2a:

    • a <= (1-d)/3
    • a <= d
    • a >= (1-d)/4
    • a >= d/2

    It's not immediately obvious how this constrains a and d, but you can easily plot the constraints in the a/d coordinate space. Then, you know which two of the above four inequalities are most important:

    d/2 <= a <= (1-d)/3

    From here:

    d/2 <= (1-d)/3

    So d <= 2/5.