Search code examples
zlib

zlib decompression invalid distances set


./minigzip: my_file_name.gz: invalid distances set

Downloaded the code from https://github.com/madler/zlib, and tested with different versions of zlib with the command:

git reset --hard && git clean -df && git checkout vxxx && ./configure && make && make test && ./minigzip -d my_file_name.gz

Turns out v1.2.3.4 ~ v1.2.7 will throw this "invalid distances set" error, v1.2.3.3 and below, v1.2.7.1 and following versions are decompressing the file just OK.

This problem is extracted from our hadoop cluster. We have a job to write gzipped files and then a job to consume it. Thousands of files in the consuming job, randomly one of them will throw this error. We are using v1.2.7 of zlib native library.

However when using gzip (gunzip) commnad utility for this file, it can be normally decompressed. Hence the different versions of zlib testing above.

Is this just corrupted data or should I upgrade the zlib version?


Solution

  • With confirmation from Adler, there actually was a bug fixed in the decompressor side of zlib v1.2.7.1.

    To reproduce the problem, I decompressed the .gz file with latest version of zlib, then compressed the decompressed file with v1.2.7, and got the exact same size compressed file:

    ~/workdir$ ll
    -rw------- 1 xxx xxx 232917709 Mar 15 13:54 my_file_name.gz
    
    ~/zlib/$ git reset --hard && git clean -df && git checkout v1.2.11 && ./configure && make && ./minigzip -d ../workdir/my_file_name.gz
    
    ~/workdir$ ll
    -rw-rw-r-- 1 xxx xxx 705650679 Mar 15 13:59 my_file_name
    
    ~/zlib/$ git reset --hard && git clean -df && git checkout v1.2.7 && ./configure && make && ./minigzip ../workdir/my_file_name
    
    ~/workdir$ ll
    -rw-rw-r-- 1 xxx xxx 232917709 Mar 15 14:02 my_file_name.gz
    
    

    After this, a decompression with v1.2.7 will throw the same error:

    ~/zlib/$ ./minigzip -d ../workdir/my_file_name.gz
    ./minigzip: ../workdir/my_file_name.gz: invalid distances set
    

    That says it's not a data corruption.

    Note: Some of the outputs were edited.