Search code examples
pythongzippython-2.xbzip2

How do the compression codecs work in Python?


I'm querying a database and archiving the results using Python, and I'm trying to compress the data as I write it to the log files. I'm having some problems with it, though.

My code looks like this:

log_file = codecs.open(archive_file, 'w', 'bz2')
for id, f1, f2, f3 in cursor:
    log_file.write('%s %s %s %s\n' % (id, f1 or 'NULL', f2 or 'NULL', f3))

However, my output file has a size of 1,409,780. Running bunzip2 on the file results in a file with a size of 943,634, and running bzip2 on that results in a size of 217,275. In other words, the uncompressed file is significantly smaller than the file compressed using Python's bzip codec. Is there a way to fix this, other than running bzip2 on the command line?

I tried Python's gzip codec (changing the line to codecs.open(archive_file, 'a+', 'zip')) to see if it fixed the problem. I still get large files, but I also get a gzip: archive_file: not in gzip format error when I try to uncompress the file. What's going on there?


EDIT: I originally had the file opened in append mode, not write mode. While this may or may not be a problem, the question still holds if the file's opened in 'w' mode.


Solution

  • As other posters have noted, the issue is that the codecs library doesn't use an incremental encoder to encode the data; instead it encodes every snippet of data fed to the write method as a compressed block. This is horribly inefficient, and just a terrible design decision for a library designed to work with streams.

    The ironic thing is that there's a perfectly reasonable incremental bz2 encoder already built into Python. It's not difficult to create a "file-like" class which does the correct thing automatically.

    import bz2
    
    class BZ2StreamEncoder(object):
        def __init__(self, filename, mode):
            self.log_file = open(filename, mode)
            self.encoder = bz2.BZ2Compressor()
    
        def write(self, data):
            self.log_file.write(self.encoder.compress(data))
    
        def flush(self):
            self.log_file.write(self.encoder.flush())
            self.log_file.flush()
    
        def close(self):
            self.flush()
            self.log_file.close()
    
    log_file = BZ2StreamEncoder(archive_file, 'ab')
    

    A caveat: In this example, I've opened the file in append mode; appending multiple compressed streams to a single file works perfectly well with bunzip2, but Python itself can't handle it (although there is a patch for it). If you need to read the compressed files you create back into Python, stick to a single stream per file.