Search code examples
c++bytebithuffman-codedeflate

Am I packing bytes correctly for DEFLATE compression?


So, I know there are lots of libraries available to do DEFLATE compression. If I were working on a production product I would use something like zlib. But as a hobby, I'm implementing it myself to try and figure it out. So after a couple of weeks of coding, re-coding, and tweaking, I'm finally able to produce some output in a reasonable time frame that I think is decent. If I try to post my output into one of the online tools however, I get errors that don't necessarily help me in figuring out the problem with my output. When I have my program generate an actual string of bits and I parse it out by hand, everything seems to conform to DEFLATE standard and I can rebuild my data. This leads me to believe that my encoding is correct but I may completely be misunderstanding the differing bit orders when packing the bytes. The following are a Base64 encoded version of my output and then the list of 8bit bytes generated by my program. If anyone can assist in pointing me to where the data is failing it would be much appreciated.

Defective program output (both Base64 and raw bytes):

Base64 Encoded Output:
ZYQhAQAADMKqQBWagELQXz/AzTQX+eAB

Byte List:
01100101
10000100
00100001
00000001
00000000
00000000
00001100
11000010
10101010
01000000
00010101
10011010
10000000
01000010
11010000
01011111
00111111
11000000
11001101
00110100
00010111
11111001
11100000
00000001

As an overview of my understanding of the documentation. The standard says a block starts with 1 bit to declare whether its the last block, then 2bits to declare what type of compression is used then 5bits hlit, 5bits hdist, 4bits hclen, then hclen+4 sets of 3bits each that give the code lengths for the huffman code used to output the code lengths of the the literal/length codes as well as the distance codes. After this comes the huffman encoded string of hlit+257+hdist+1 code lengths and then finally the huffman encoded string of actual compressed data capped off by the end of block code. The interesting part is that huffman codes themselves are packed in reverse order... Where I get confused however concerns the the "extra bits" that come after some of the length codes (codes 16, 17, 18) as well as after the higher length and distance codes. Do those get packed in the same reverse order as the huffman codes or are they treated as "data other than huffman code"?

Looking at first byte in list (byte 0):

*01 = last block bit
*02 = 2bit compression type (10 = dynamic huffman)
*03 = msb of hlit (#of literal/length codes - 257)

 *03                 *02     *01
  v                   v       v
+-------------------------------+
| 0   1   1   0   0   1   0   1 |
+-------------------------------+
| Byte 0                        |



Looking at bytes 8 and 9 (starting with byte 0):

*01 = last bit of hclen + 4 sets of codelen code lengths
*02 = msb of huffman code "10" ("10" = codelen code 18 - repeat 0 11-138 times)
*03 = lsb of 7 "extra bits" for codelen code 18
*04 = msb of 7 "extra bits" for codelen code 18

                  *03     *02 *01                           *04
                   v       v   v                             v
+---------------------------------+---------------------------------+
|  1   0   1   0   1   0   1   0  |  0   1   0   0   0   0   0   0  |
+---------------------------------+---------------------------------+
| Byte 8                          | Byte 9                          |

Here's some additional output from my program with the actual huffman codes used:

--------------------------------------------------------------------------
Literal/Length Bit Codes:  Block: 0    hlit: (269 - 257) = 12
--------------------------------------------------------------------------
Code: 32        Count: 1        BitCode: 000                Bit Length: 3                   
Code: 33        Count: 1        BitCode: 001                Bit Length: 3                   
Code: 66        Count: 1        BitCode: 010                Bit Length: 3                   
Code: 97        Count: 1        BitCode: 011                Bit Length: 3                   
Code: 98        Count: 1        BitCode: 100                Bit Length: 3                   
Code: 104       Count: 1        BitCode: 101                Bit Length: 3                   
Code: 108       Count: 1        BitCode: 110                Bit Length: 3                   
Code: 256       Count: 1        BitCode: 1110               Bit Length: 4                   
Code: 268       Count: 1        BitCode: 1111               Bit Length: 4                   

--------------------------------------------------------------------------
Distance Bit Codes:  Block: 0    hdist: (5 - 1) = 4
--------------------------------------------------------------------------
Code: 4         Count: 1        BitCode: 00                 Bit Length: 2                   

--------------------------------------------------------------------------
CodeLength Bit Codes:  Block: 0    hclen: (16 - 4) = 12
--------------------------------------------------------------------------
Code: 2         Count: 1        BitCode: 110                Bit Length: 3                   
Code: 3         Count: 7        BitCode: 00                 Bit Length: 2                   
Code: 4         Count: 2        BitCode: 111                Bit Length: 3                   
Code: 17        Count: 4        BitCode: 01                 Bit Length: 2                   
Code: 18        Count: 5        BitCode: 10                 Bit Length: 2                   

--------------------------------------------------------------------------
--------------------------------------------------------------------------

Solution

  • The extra bits are not reversed.

    Your problem is that a single distance code of length 2 is not permitted. A single distance code must have length 1. From RFC 1951:

    If only one distance code is used, it is encoded using one bit, not zero bits; in this case there is a single code length of one, with one unused code.