Search code examples
pythoncompressionbz2

Python bz2 sequential compressor produces invalid data stream on low compression levels


I have a series of strings in a list named 'lines' and I compress them as follows:

import bz2
compressor = bz2.BZ2Compressor(compressionLevel)
for l in lines:
    compressor.compress(l)
compressedData = compressor.flush()
decompressedData = bz2.decompress(compressedData)

When compressionLevel is set to 8 or 9, this works fine. When it's any number between 1 and 7 (inclusive), the final line fails with an IOError: invalid data stream. The same occurs if I use the sequential decompressor. However, if I join the strings into one long string and use the one-shot compressor function, it works fine:

import bz2
compressedData = bz2.compress("\n".join(lines))
decompressedData = bz2.decompress(compressedData)
# Works perfectly

Do you know why this would be and how to make it work at lower compression levels?


Solution

  • You are throwing away the compressed data returned by compressor.compress(l) ... docs say "Returns a chunk of compressed data if possible, or an empty byte string otherwise." You need to do something like this:

    # setup code goes here
    for l in lines:
        chunk = compressor.compress(l)
        if chunk: do_something_with(chunk)
    chunk = compressor.flush()
    if chunk: do_something_with(chunk)
    # teardown code goes here
    

    Also note that your oneshot code uses "\n".join() ... to check this against the chunked result, use "".join()

    Also beware of bytes/str issues e.g. the above should be b"whatever".join().

    What version of Python are you using?