Search code examples
pythonserializationcompressionpickle

What's the most space-efficient way to compress serialized Python data?


From the Python documentation:

By default, the pickle data format uses a relatively compact binary representation. If you need optimal size characteristics, you can efficiently compress pickled data.

I'm going to be serializing several gigabytes of data at the end of a process that runs for several hours, and I'd like the result to be as small as possible on disk. However, Python offers several different ways to compress data.

Is there one of these that's particularly efficient for pickled files? The data I'm pickling mostly consists of nested dictionaries and strings, so if there's a more efficient way to compress e.g. JSON, that would work too.

The time for compression and decompression isn't important, but the time this process takes to generate the data makes trial and error inconvenient.


Solution

  • I've done some test using a Pickled object, lzma gave the best compression.

    But your results can vary based on your data, I'd recommend testing them with some sample data of your own.

    Mode                LastWriteTime         Length Name
    ----                -------------         ------ ----
    -a----        9/17/2019  10:05 PM       23869925 no_compression.pickle
    -a----        9/17/2019  10:06 PM        6050027 gzip_test.gz
    -a----        9/17/2019  10:06 PM        3083128 bz2_test.pbz2
    -a----        9/17/2019  10:07 PM        1295013 brotli_test.bt
    -a----        9/17/2019  10:06 PM        1077136 lzma_test.xz
    

    Test file used (you'll need to pip install brotli or remove that algorithm):

    import bz2
    import gzip
    import lzma
    import pickle
    
    import brotli
    
    
    class SomeObject():
    
        a = 'some data'
        b = 123
        c = 'more data'
    
        def __init__(self, i):
            self.i = i
    
    
    data = [SomeObject(i) for i in range(1, 1000000)]
    
    with open('no_compression.pickle', 'wb') as f:
        pickle.dump(data, f)
    
    with gzip.open("gzip_test.gz", "wb") as f:
        pickle.dump(data, f)
    
    with bz2.BZ2File('bz2_test.pbz2', 'wb') as f:
        pickle.dump(data, f)
    
    with lzma.open("lzma_test.xz", "wb") as f:
        pickle.dump(data, f)
    
    with open('no_compression.pickle', 'rb') as f:
        pdata = f.read()
        with open('brotli_test.bt', 'wb') as b:
            b.write(brotli.compress(pdata))