algorithm zip compression huffman-code deflate

Memory usage by DEFLATE algorithm

Many implementations of the DEFLATE algorithm allow for stream-based inputs. An example is Java's java.util.zip.ZipOutputStream. As far as I'm aware, DEFLATE builds a Huffman tree prior to compression and writes this tree at the BEGINNING of a .zip file.

However, in order to build a Huffman tree, the program needs counts of the most frequently occurring characters, which it has only after reading the file. If the stream is so large that it cannot fit in memory at once, how is the Huffman tree built?

Solution

DEFLATE compresses data in blocks. Each block includes its own tree/counts, and is usually about 16KB uncompressed.

The format is specified in the RFC 1951