File under: "Unexpected Efficiency Dept."
The first 90 million numbers take up about 761MB, as output by:
seq 90000000
According to man parallel
, it can speed up gzip
's archiving big files by chopping the input up, and using different CPUs to compress the chunks. So even though gzip
is single-threaded this technique makes it multi-threaded:
seq 90000000 | parallel --pipe --recend '' -k gzip -9 >bigfile.gz
Took 46 seconds, on an Intel Core i3-2330M (4) @ 2.2GHz.
Pipe that to plain old gzip
:
seq 90000000 | gzip -9 > bigfile2.gz
Took 80 seconds, on the same CPU. Now the surprise:
ls -log bigfile*.gz
Output:
-rw-rw-r-- 1 200016306 Jul 3 17:27 bigfile.gz
-rw-rw-r-- 1 200381681 Jul 3 17:30 bigfile2.gz
300K larger? That didn't look right. First I checked with zdiff
if the files had the same contents -- yes, the same. I'd have supposed any compressor would do better with a continuous data stream than a chunked one. Why isn't bigfile2.gz
smaller than bigfile.gz
?
The reason is that for this particular, rather unusual input, smaller deflate blocks are better than larger ones. By default gzip
uses larger deflate blocks, since that works best for normal input data. The parallel
command is forcing a few smaller deflate blocks by breaking up the input every 1 MB, resulting in a small gain. Though most of the blocks are still the same size.
You can do much better by setting a smaller block size for every block by using zlib's memLevel
parameter in deflateInit2()
. Here I compress the same output in a single thread each time, using memLevel
values from 9 to 2, where a smaller memLevel
is a smaller deflate block size (note that zlib does a little better than your gzip
at the default level):
The optimum memLevel
for this data turns out to be 4, for which the compressed data is 12 MB (9%) smaller than for the default memLevel
of 8. For memLevel
8, the deflate block size is 16383 symbols, whereas for memLevel
4, the deflate block size is 1023 symbols. One symbol is either a literal byte or a match.
The improvement comes from the extremely regular nature of the input, resulting in a regular sequence of match and literal commands. The smaller the block size, the fewer such distinct commands that appear, which then takes fewer bits to code each of them. This is still true for memLevel
3, but by then the overhead of the code description at the beginning of each deflate block cancels the improvement from fewer distinct codes.
zopfli
is a deflate compressor that optimizes the block size and the commands selected, and managed to compress it to 100,656,812 bytes. It took three and a half hours though! zopfli
is invoked with pigz
using compression level 11.