Search code examples
algorithmcompressionhuffman-code

Generating a Huffman code that never produces the string "00" upon encoding


I am trying to use Huffman coding to create an optimal coding for a set of symbols. However, a constraint is placed on the encodings such that no encoding contains the string, "00".
For example, the encoding 'A' = '0' and 'B' = '10' would not satisfy the constraint because the string 'BA' encodes to '100', which contains the "00" substring.
This means that the code words also cannot contain the string "00". For example, the encoding 'A' = '1', B = '00', and C = '01' would not satisfy the constraint because encoding 'B' would always result in '00' appearing in the encoding.

I have tried modifying the Huffman coding algorithm found on Wikipedia:

  1. Create a leaf node for each symbol and add it to the priority queue.
  2. While there is more than one node in the queue:
    1. Remove the two nodes of highest priority (lowest probability) from the queue
      • If both nodes are not leaf nodes, select the highest priority node and the highest priority leaf node. This ensures that at least one of the selected nodes is a leaf node.
    2. Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
      • If one node is not a leaf node make that node the right child of the new internal node (making it a '1' when encoding). This avoids creating the "00" substring.
    3. Add the new node to the queue.
  3. The remaining node is the root node and the tree is complete.
  4. Add a '1' to the beginning of all codes to avoid the "00" substring when two adjacent symbols are encoded.

There is also the case where the only two nodes left in the queue are both non-leaf nodes. I am not sure how to solve this problem. Otherwise, I believe this creates a coding that satisfies the constraint, but I am unsure if it is optimal, and I would like to be certain that it is.


Solution

  • The easiest way is to not mess with the Huffman code at all. Instead, post-process the output.

    We will start with simple bit stuffing. When encoding, take your coded bit stream and whenever there is a 0, insert a 1 after it. On the decoding end, whenever you see a 0, remove the next bit (which will be the 1 that was inserted). Then do the normal Huffman decoding.

    This is not optimal, but the departure from optimality is bounded. You can reduce the impact of the bit stuffing by swapping the branches at every node, as needed, to put the lower probabilities or weights on the 0 sides.

    This simple bit stuffing expands the input, on average, by a factor of 1.5.

    So how close is this simple bit stuffing to optimal? It turns out the the number of possible n-bit patterns that have no occurrences of two 0 bits in a row is Fn+2, where Fn is the nth Fibonacci number. With such a sequence we can code at most log2(Fn+2) bits. The optimal expansion ratio of n bits is then n / log2(Fn+2). In the limit of large n, that is 1 / log2(𝜑), where 𝜑 is the Golden Ratio. That optimal expansion ratio is 1.44042.

    So the 1.5 from simple bit stuffing actually isn't too shabby. It's within 4% of optimal.

    But we can do better.

    We can use the Fibonacci sequence, which represents the number of possible coded values for each bit added to the sequence without repeating 0s, to code the input bits. We show such an approach below, though for convenience, we avoid any repeating 1s instead of any repeating 0s. (Just invert the output to avoid repeating 0s.) Here is example C code that does that encoding and decoding for a fixed-size input and output:

    typedef unsigned __int128 u128_t;
    
    // Encode 87 bits to a 125-bit sequence that does not have two 1 bits in a row
    // anywhere. Note that if such sequences are concatenated and the high bit of
    // the 125 is a 1, then a 0 bit needs to be appended to make it 126 bits. This
    // will occur 38.2% of the time (1 / goldenratio^2). The final expansion ratio
    // of this encoding is then 125.382 / 87 = 1.44117. The theoretical optimum
    // ratio is 1 / lg(goldenratio) = 1.44042. This encoding gets within 0.05% of
    // optimal.
    u128_t encode87to125(u128_t n) {
        n &= ((u128_t)1 << 87) - 1;
        u128_t e = 0;
    
        // Fibonacci numbers 126 and 125. (gcc and clang do not support 128-bit
        // literals, so these are assembled, which will occur at compile time.)
        u128_t fn = ((u128_t)0x4f88f2 << 64) | 0x877183a413128aa8u,
               fnm1 = ((u128_t)0x3127c1 << 64) | 0xed0f4dff88ba1575u;
        for (;;) {
            // Encode one bit.
            e <<= 1;
            if (n >= fn) {
                e |= 1;
                n -= fn;
            }
    
            if (fn == 1)
                // Done when the start of the Fibonacci sequence (1, 1) reached.
                break;
    
            // Compute the Fibonacci number that precedes fnm1, and move fn and
            // fnm1 both down one in the sequence.
            u128_t fnm2 = fn - fnm1;
            fn = fnm1;
            fnm1 = fnm2;
        }
        return e;
    }
    
    // Decode a 125-bit value encoded by encode87to125() back to the orignal 87-bit
    // value.
    u128_t decode125to87(u128_t e) {
        // Start at the beginning of the Fibonacci sequence (1, 1).
        u128_t d = 0, fn = 1, fnm1 = 1;
        for (;;) {
            // Decode the low bit of e.
            if (e & 1)
                d += fn;
            e >>= 1;
    
            if (e == 0)
                // Done when there are no more 1 bits in e, since nothing more will
                // be added to d.
                break;
    
            // Advance fnm1 and fn up one spot in the Fibonacci sequence.
            u128_t fnp1 = fn + fnm1;
            fnm1 = fn;
            fn = fnp1;
        }
        return d;
    }
    

    The input is then encoded 87 bits at a time, and the output is 125 or 126 bits for each input block, the latter if the 125 bits has a 1 bit in the top position, in which case a 0 needs to be stuffed.

    The values 87 and 125 were picked since they are the most efficient set that permits the output to fit in 128 bits. This gives an expansion ratio of 1.44117, within 0.05% of optimal. Many other choices are possible. The output is encoded a bit at a time, and decoded a bit at a time, so there is no need to accumulate it in an integer. Then we could go to 112 bits encoded in 161 bits. Or we could limit ourselves to 64-bit arithmetic and convert 62 bits to 89 bits (within 0.09% of optimal).

    At the end of the bit sequence, the remaining bits can be extended to 87 bits with high zeros, and the encoded bits will then have high zeros that don't need to be sent. When decoding, fill out the last bits of the 125 with high 0s. If you don't know how many bits to expect when decoding, then append a single high 1 bit to the input before encoding, followed by high 0 bits. Then when decoding, scan back from the end through the 0 bits to get to the first 1 bit, and then discard all of those.

    This post-processing approach to meet the constraints on the bit stream is arguably preferable to any attempt at modifying the Huffman code to be optimal for different-weight letters. Not only does it permit fast and well-tested Huffman algorithm implementations to be used as is, but also this will work with any entropy coding scheme, be it Huffman, arithmetic, or asymmetric numeral systems.