Search code examples
pythongzipchecksum

Compress a file in memory, compute checksum and write it as `gzip` in python


I want to compress files and compute the checksum of the compressed file using python. My first naive attempt was to use 2 functions:

def compress_file(input_filename, output_filename):
    f_in = open(input_filename, 'rb')
    f_out = gzip.open(output_filename, 'wb')
    f_out.writelines(f_in)
    f_out.close()
    f_in.close()


def md5sum(filename):
    with open(filename) as f:
        md5 = hashlib.md5(f.read()).hexdigest()
    return md5

However, it leads to the compressed file being written and then re-read. With many files (> 10 000), each several MB when compressed, in a NFS mounted drive, it is slow.

How can I compress the file in a buffer and then compute the checksum from this buffer before writing the output file?

The file are not that big so I can afford to store everything in memory. However, a nice incremental version could be nice too.

The last requirement is that it should work with multiprocessing (in order to compress several files in parallel).

I have tried to use zlib.compress but the returned string miss the header of a gzip file.

Edit: following @abarnert sggestion, I used python3 gzip.compress:

def compress_md5(input_filename, output_filename):
    f_in = open(input_filename, 'rb')
    # Read in buffer
    buff = f_in.read()
    f_in.close()
    # Compress this buffer
    c_buff = gzip.compress(buff)
    # Compute MD5
    md5 = hashlib.md5(c_buff).hexdigest()
    # Write compressed buffer
    f_out = open(output_filename, 'wb')
    f_out.write(c_buff)
    f_out.close()

    return md5

This produce a correct gzip file but the output is different at each run (the md5 is different):

>>> compress_md5('4327_010.pdf', '4327_010.pdf.gz')
'0d0eb6a5f3fe2c1f3201bc3360201f71'
>>> compress_md5('4327_010.pdf', '4327_010.pdf.gz')
'8e4954ab5914a1dd0d8d0deb114640e5'

The gzip program doesn't have this problem:

 $ gzip -c 4327_010.pdf | md5sum
 8965184bc4dace5325c41cc75c5837f1  -
 $ gzip -c 4327_010.pdf | md5sum
 8965184bc4dace5325c41cc75c5837f1  -

I guess it's because the gzip module use the current time by default when creating a file (the gzip program use the modification of the input file I guess). There is no way to change that with gzip.compress.

I was thinking to create a gzip.GzipFile in read/write mode, controlling the mtime but there is no such mode for gzip.GzipFile.

Inspired by @zwol suggestion I wrote the following function which correctly sets the filename and the OS (Unix) in the header:

def compress_md5(input_filename, output_filename):
    f_in = open(input_filename, 'rb')    
    # Read data in buffer
    buff = f_in.read()
    # Create output buffer
    c_buff = cStringIO.StringIO()
    # Create gzip file
    input_file_stat = os.stat(input_filename)
    mtime = input_file_stat[8]
    gzip_obj = gzip.GzipFile(input_filename, mode="wb", fileobj=c_buff, mtime=mtime)
    # Compress data in memory
    gzip_obj.write(buff)
    # Close files
    f_in.close()
    gzip_obj.close()
    # Retrieve compressed data
    c_data = c_buff.getvalue()
    # Change OS value
    c_data = c_data[0:9] + '\003' + c_data[10:]
    # Really write compressed data
    f_out = open(output_filename, "wb")
    f_out.write(c_data)
    # Compute MD5
    md5 = hashlib.md5(c_data).hexdigest()
    return md5

The output is the same at different run. Moreover the output of file is the same than gzip:

$ gzip -9 -c 4327_010.pdf > ref_max/4327_010.pdf.gz
$ file ref_max/4327_010.pdf.gz 
ref_max/4327_010.pdf.gz: gzip compressed data, was "4327_010.pdf", from Unix, last modified: Tue May  5 14:28:16 2015, max compression
$ file 4327_010.pdf.gz 
4327_010.pdf.gz: gzip compressed data, was "4327_010.pdf", from Unix, last modified: Tue May  5 14:28:16 2015, max compression

However, md5 is different:

$ md5sum 4327_010.pdf.gz ref_max/4327_010.pdf.gz 
39dc3e5a52c71a25c53fcbc02e2702d5  4327_010.pdf.gz
213a599a382cd887f3c4f963e1d3dec4  ref_max/4327_010.pdf.gz

gzip -l is also different:

$ gzip -l ref_max/4327_010.pdf.gz 4327_010.pdf.gz 
     compressed        uncompressed  ratio uncompressed_name
        7286404             7600522   4.1% ref_max/4327_010.pdf
        7297310             7600522   4.0% 4327_010.pdf

I guess it's because the gzip program and the python gzip module (which is based on the C library zlib) have a slightly different algorithm.


Solution

  • Wrap a gzip.GzipFile object around an io.BytesIO object. (In Python 2, use cStringIO.StringIO instead.) After you close the GzipFile, you can retrieve the compressed data from the BytesIO object (using getvalue), hash it, and write it out to a real file.

    Incidentally, you really shouldn't be using MD5 at all anymore.