Search code examples
python-3.xgzipfile-comparison

Compare compressed files with gzip and filecmp modules returns False on python 3.10


When I run the following code on python3, the result of the filecmp() is False. Why is it so? I thought compressing twice the same file would output two files with the same exact content.

import filecmp
import shutil
import gzip

with open('base_file.fastq', 'rb') as f_in:
  with gzip.open('compressed_one.fastq.gz', "wb") as f_out:
    shutil.copyfileobj(f_in, f_out)


with open('base_file.fastq', 'rb') as f_in:
  with gzip.open('compressed_two.fastq.gz', "wb") as f_out:
    shutil.copyfileobj(f_in, f_out)

filecmp.cmp('compressed_one.fastq.gz',  'compressed_two.fastq.gz', shallow=False)

Solution

  • The gziped file is a structured file rather than just a file with some bytes. It has header information, compressed content, and a footer. When you create a gzip file, the header and footer is also added to the compressed file content.

    In your case, you are compressing the same file twice with the same compression code and level, so the compressed content will be same. But the gzip header is going to be different. As per the documentation python allows you to configure the header values of filename and modification_time. If these are not specified, a default value like current time is used.

    In your case, every time you compress the same file, everything remains the same but the header is different. So there is a change in the file content and filecmp returns False. If you want to make the output files the same, then you can use:

    gzip.GzipFile(filename=None, mode=None, compresslevel=9, fileobj=None, mtime=None)

    to compress the contents, with identical header information.