Search code examples
c++zlibdeflate

How to find the end of a DEFLATE block


For self education purposes I am trying to create a program that will convert a png file into an array of RGBA values. However I am having a problem with decoding the IDAT sections which are encoded with zlib using the deflate format. The problem I am having is I do not know how to find where the end of a compressed block is. The only place in the documentation where a length is shown is just for decompressed blocks however for block with the default huffman tables and the huffman tables supplied there does not seem to be a way to find out where the block ends. How should i go about finding the end of a block in the deflate format.


Solution

  • Per the PNG Deflate/Inflate Compression documentation

    The compressed data within the zlib datastream is stored as a series of blocks, each of which can represent raw (uncompressed) data, LZ77-compressed data encoded with fixed Huffman codes, or LZ77-compressed data encoded with custom Huffman codes. A marker bit in the final block identifies it as the last block, allowing the decoder to recognize the end of the compressed datastream. Further details on the compression algorithm and the encoding are given in the deflate specification [RFC-1951].

    ...

    In a PNG file, the concatenation of the contents of all the IDAT chunks makes up a zlib datastream as specified above. This datastream decompresses to filtered image data as described elsewhere in this document.

    It is important to emphasize that the boundaries between IDAT chunks are arbitrary and can fall anywhere in the zlib datastream. There is not necessarily any correlation between IDAT chunk boundaries and deflate block boundaries or any other feature of the zlib data. For example, it is entirely possible for the terminating zlib check value to be split across IDAT chunks.*"

    And per RFC 1951: "DEFLATE Compressed Data Format Specification version 1.3":

    A compressed data set consists of a series of blocks, corresponding to successive blocks of input data. The block sizes are arbitrary, except that non-compressible blocks are limited to 65,535 bytes.

    Each block is compressed using a combination of the LZ77 algorithm and Huffman coding. The Huffman trees for each block are independent of those for previous or subsequent blocks; the LZ77 algorithm may use a reference to a duplicated string occurring in a previous block, up to 32K input bytes before.

    Each block consists of two parts: a pair of Huffman code trees that describe the representation of the compressed data part, and a compressed data part. (The Huffman trees themselves are compressed using Huffman encoding.) The compressed data consists of a series of elements of two types: literal bytes (of strings that have not been detected as duplicated within the previous 32K input bytes), and pointers to duplicated strings, where a pointer is represented as a pair <length, backward distance>. The representation used in the "deflate" format limits distances to 32K bytes and lengths to 258 bytes, but does not limit the size of a block, except for uncompressible blocks, which are limited as noted above.

    Each type of value (literals, distances, and lengths) in the compressed data is represented using a Huffman code, using one code tree for literals and lengths and a separate code tree for distances. The code trees for each block appear in a compact form just before the compressed data for that block.

    So, to determine the end of a given block, you would have to parse the block's Huffman codes to know the location and type of each element within the compressed data, and then you can process each element as needed until you find the end of the last element in the block. Section 3.2 goes into the technical details of how the compressed blocks are formatted, particularly Section 3.2.3:

    3.2.3. Details of block format

    Each block of compressed data begins with 3 header bits containing the following data:

    first bit       BFINAL
    next 2 bits     BTYPE
    

    Note that the header bits do not necessarily begin on a byte boundary, since a block does not necessarily occupy an integral number of bytes.

    BFINAL is set if and only if this is the last block of the data set.

    BTYPE specifies how the data are compressed, as follows:

    00 - no compression
    01 - compressed with fixed Huffman codes
    10 - compressed with dynamic Huffman codes
    11 - reserved (error)
    

    The only difference between the two compressed cases is how the Huffman codes for the literal/length and distance alphabets are defined.

    In all cases, the decoding algorithm for the actual data is as follows:

    do
        read block header from input stream.
        if stored with no compression
            skip any remaining bits in current partially
                processed byte
            read LEN and NLEN (see next section)
            copy LEN bytes of data to output
        otherwise
            if compressed with dynamic Huffman codes
                read representation of code trees (see
                    subsection below)
            loop (until end of block code recognized)
                decode literal/length value from input stream
                if value < 256
                    copy value (literal byte) to output stream
                otherwise
                    if value = end of block (256)
                        break from loop
                    otherwise (value = 257..285)
                        decode distance from input stream
    
                        move backwards distance bytes in the output
                        stream, and copy length bytes from this
                        position to the output stream.
            end loop
    while not last block
    

    Note that a duplicated string reference may refer to a string in a previous block; i.e., the backward distance may cross one or more block boundaries. However a distance cannot refer past the beginning of the output stream. (An application using a preset dictionary might discard part of the output stream; a distance can refer to that part of the output stream anyway) Note also that the referenced string may overlap the current position; for example, if the last 2 bytes decoded have values X and Y, a string reference with <length = 5, distance = 2> adds X,Y,X,Y,X to the output stream.