Search code examples
pythonpython-3.xarchivezstd

Stream a .zst compressed file line by line


I am trying to sift through a big database that is compressed in a .zst. I am aware that I can simply just decompress it and then work on the resulting file, but that uses up a lot of space on my ssd and takes 2+ hours so I would like to avoid that if possible.

Often when I work with large files I would stream it line by line with code like

with open(filename) as f:
    for line in f.readlines():
        do_something(line)

I know gzip has this

with gzip.open(filename,'rt') as f:
    for line in f:
        do_something(line)

but it doesn't seem to work with .zsf, so I am wondering if there're any libraries that can decompress and stream the decompressed data in a similar way. For example:

with zstlib.open(filename) as f:
    for line in f.zstreadlines():
        do_something(line)

Solution

  • Knowing which package to use and what the corresponding docs are can be a bit confusing, as there appears to be several Python bindings to the actual Zstandard library.

    Below, I am referring to the library by Gregory Szorc, that I installed from condas default channel with:

    conda install zstd
    
    # check:
    
    conda list zstd
    # # Name                    Version                   Build  Channel
    # zstd                      1.5.5                hc292b87_0  
    

    (even though the docs say to install with pip, which I don't unless there is no other way, as I like my conda environments to remain usable).

    I am only inferring that this version is the one from G. Szorc, based on the comments in the __init__.py file:

    # Copyright (c) 2017-present, Gregory Szorc
    # All rights reserved.
    #
    # This software may be modified and distributed under the terms
    # of the BSD license. See the LICENSE file for details.
    
    """Python interface to the Zstandard (zstd) compression library."""
    
    from __future__ import absolute_import, unicode_literals
    
    # This module serves 2 roles:
    #
    # 1) Export the C or CFFI "backend" through a central module.
    # 2) Implement additional functionality built on top of C or CFFI backend.
    

    Thus, I think that the corresponding documentation is here.

    In any case, quick test after install:

    import zstandard as zstd
    
    with zstd.open('test.zstd', 'w') as f:
        for i in range(10_000):
            f.write(f'foo {i} bar\n')
    
    with zstd.open('test.zstd', 'r') as f:
        for i, line in enumerate(f):
            if i % 1000 == 0:
                print(f'line {i:4d}: {line}', end='')
    

    Produces:

    line    0: foo 0 bar
    line 1000: foo 1000 bar
    line 2000: foo 2000 bar
    line 3000: foo 3000 bar
    line 4000: foo 4000 bar
    line 5000: foo 5000 bar
    line 6000: foo 6000 bar
    line 7000: foo 7000 bar
    line 8000: foo 8000 bar
    line 9000: foo 9000 bar
    

    Notes:

    1. if the file was written in binary (not text), then use mode='rb', same as a regular file. The underlying file is always written in binary mode, but if we use text mode for open, then according to open's doc, "(...) an io.TextIOWrapper if opened for reading or writing in text mode".
    2. notice that I use the iterator of f, not readlines(). From the inline docstring, they make it sound like readlines() returns a list of lines from the file, i.e. the whole thing is slurped in memory. With the iterator, it is more likely that only portions of the file are in memory at any moment (in zstd's buffer).
    3. Reading this part of the docs however, I am less sure of the above. Stay tuned... (Edit: tested empirically, it holds, see below).

    Addendum

    ABout notes 2 and 3 above: I tested empirically, by changing the number of lines to 100 millions and compared the memory usage of two versions (using htop):

    Streaming version

    with zstd.open('test.zstd', 'r') as f:
        for i, line in enumerate(f):
            if i % 10_000_000 == 0:
                print(f'line {i:8d}: {line}', end='')
    

    --no bump in memory usage.

    Readlines version

    with zstd.open('test.zstd', 'r') as f:
        for i, line in enumerate(f.readlines()):
            if i % 10_000_000 == 0:
                print(f'line {i:8d}: {line}', end='')
    

    --bump in memory usage by a few GBs.

    This may be specific to the version installed (1.5.5).