Why is the literal/length alphabet in the DEFLATE format 286 symbols long?

According to the DEFLATE specification (RFC 1951), the literal and length alphabets are combined in order to be decoded using one huffman tree. The literal and length alphabets are both 256 symbols large, but the combined literal/length alphabet is 286 symbols long, one of those symbols being an end-of-block character.

There are only 29 out of the possible 256 length symbols represented in the combined alphabet, with extra bits being included in the compressed data after the length symbol in order to read the full value of the length when decoding. These extra bits are not compressed, being read as literal machine integers.

Why not include all 256 length values in the combined alphabet, to get a literal/length alphabet of size 512 (513 when including the end-of-block character)? Would this not compress the lengths better?

Solution

Yes, that would compress the lengths better. But not by much. I tried it on a few large files, and I saw about a 0.25% reduction in the compressed file sizes.

I can't speak for Phil Katz in order to answer the "why" question. (Phil is long dead.) I can only guess that he decided to apply the same approach used on the distance codes to the length codes, in order to reduce the number of symbols that would need to be Huffman coded. He had to do that for the distance codes in order to get compression with a code length limit of 15 bits (which was important on the 16-bit processors of the time). He probably wanted to limit the number of symbols for the literal/length code as well to reduce the time spent Huffman coding them, and the space for encoding and decoding tables.