Search code examples
pythonpython-3.xlzma

Python LZMA : Compressed data ended before the end-of-stream marker was reached


I am using the built in lzma python to decode compressed chunk of data. Depending on the chunk of data, I get the following exception :

Compressed data ended before the end-of-stream marker was reached

The data is NOT corrupted. It can be decompressed correctly with other tools, so it must be a bug in the library. There are other people experiencing the same issue:

Unfortunately, none seems to have found a solution yet. At least, one that works on Python 3.5.

How can I solve this problem? Is there any work around?


Solution

  • I spent a lot of time trying to understand and solve this problem, so i thought it would a good idea to share it. The problem seems to be caused by the a chunk of data without the EOF byte properly set. In order to decompress a buffer, I used to use the lzma.decompress provided by the lzma python lib. However, this method expects each data buffer to contains a EOF bytes, otherwise it throws a LZMAError exception.

    To work around this limitation, we can implement an alternative decompress function which uses LZMADecompress object to extract the data from a buffer. For example:

    def decompress_lzma(data):
        results = []
        while True:
            decomp = LZMADecompressor(FORMAT_AUTO, None, None)
            try:
                res = decomp.decompress(data)
            except LZMAError:
                if results:
                    break  # Leftover data is not a valid LZMA/XZ stream; ignore it.
                else:
                    raise  # Error on the first iteration; bail out.
            results.append(res)
            data = decomp.unused_data
            if not data:
                break
            if not decomp.eof:
                raise LZMAError("Compressed data ended before the end-of-stream marker was reached")
        return b"".join(results)
    

    This function is similar to the one provided by the standard lzma lib with one key difference. The loop is broken if the entire buffer has been processed, before checking if we reached the EOF mark.

    I hope this can be useful to other people.