When I run the following code on python3, the result of the filecmp() is False. Why is it so? I thought compressing twice the same file would output two files with the same exact content.
import filecmp
import shutil
import gzip
with open('base_file.fastq', 'rb') as f_in:
with gzip.open('compressed_one.fastq.gz', "wb") as f_out:
shutil.copyfileobj(f_in, f_out)
with open('base_file.fastq', 'rb') as f_in:
with gzip.open('compressed_two.fastq.gz', "wb") as f_out:
shutil.copyfileobj(f_in, f_out)
filecmp.cmp('compressed_one.fastq.gz', 'compressed_two.fastq.gz', shallow=False)
The gzip
ed file is a structured file rather than just a file with some bytes. It has header information, compressed content, and a footer. When you create a gzip file, the header and footer is also added to the compressed file content.
In your case, you are compressing the same file twice with the same compression code and level, so the compressed content will be same. But the gzip header is going to be different. As per the documentation python allows you to configure the header values of filename
and modification_time
. If these are not specified, a default value like current time is used.
In your case, every time you compress the same file, everything remains the same but the header is different. So there is a change in the file content and filecmp
returns False
. If you want to make the output files the same, then you can use:
gzip.GzipFile(filename=None, mode=None, compresslevel=9, fileobj=None, mtime=None)
to compress the contents, with identical header information.