If I have a set of symbols and frequencies: A - 0.1 B - 0.40 C - 0.2 D - 0.23 E - 0.15 F - 0.17
The Huffman algorithm will produce codewords that are only greater than length 1.
But when I change a frequency to be greater than 0.40, it will produce a codeword of length 1 and greater. How can construct a proof that proves that this is the case for any set of symbols, not just this one?
(Note that your frequencies don't add to 1; I'll assume it's a typo)
Here is a sketch of a proof that to make all codewords greater than 1 bit, no frequency can be greater than 2/5. Without loss of generality, the huffman tree must look like this:
a+b+c+d (the sum must be equal to 1)
/ \
a+b c+d
/ \ / \
a b c d
We must prove that all of a, b, c, and d are no greater than 2/5.
WLOG (again) a = b <= c <= d.
2a+c+d
/ \
2a c+d
/ \ / \
a a c d
Let's find the maximal value of d
that is consistent with this Huffman tree. According to how the algorithm works, the following inequalities hold:
Let's also replace c
by 1-d-2a
:
It's not immediately obvious how this constrains a
and d
, but you can easily plot the constraints in the a
/d
coordinate space. Then, you know which two of the above four inequalities are most important:
d/2 <= a <= (1-d)/3
From here:
d/2 <= (1-d)/3
So d <= 2/5
.