Search code examples
zstd

How does zstandard compression behave when passed a size hint, instead of compressing a stream?


The zstd compressor can operate in streaming mode, or the total size to be compressed can be given in advance (for example, with the size parameter in this Python binding

How does the library behave when the size is given in advance? Is it faster, or does it use less memory or compress more effectively? What happens when you compress more, or less data than the given size?


Solution

  • I tested the python-zstandard library against the Silesia Corpus' dickens text.

    Compression takes about the same amount of time whether the size is known or unknown. The compressor produces the same number of bytes, plus a 3-byte header, for this 10MB file.

    If you tell the compressor the wrong number of bytes, it just fails when it is given more or less input than expected.

    If the size was not known on compression, you have to use the streaming decompression API instead of the one-shot .decompress(bytes) API, but I could be missing a flush frame / close frame command.

    We choose level 22 so that the memory differences will be more apparent. At more reasonable levels <= 19, memory usage is < 100MB on compression, and < 20MB on decompression - demonstrating why the command line tool guards extreme compression levels with a flag.

    According to the scalene profiler, at level 22,

    peak memory function
    267MB oneshot
    777MB onestream
    266MB rightsize
    774MB multistream
    decompression peak memory function
    9.9MB one-shot decompression
    128.5MB streaming decompression, size unknown
    19.3MB streaming decompression, size known
    (fails) one-shot decompression, size unknown
    """
    Test zstd with different options and data sizes.
    """
    
    import pathlib
    import zstandard
    import time
    import io
    import contextlib
    
    
    @contextlib.contextmanager
    def timeme():
        start = time.monotonic()
        yield
        end = time.monotonic()
        print(f"{end-start}s")
    
    
    # The Collected works of Charles Dickens from the Silesia corpus
    uncompressed = pathlib.Path("dickens").read_bytes()
    
    ZSTD_COMPRESS_LEVEL = 22
    
    
    def oneshot():
        compressor = zstandard.ZstdCompressor(level=ZSTD_COMPRESS_LEVEL)
        with timeme():
            result = compressor.compress(uncompressed)
            print("One-shot", len(result))
            return result
    
    
    def onestream():
        compressor = zstandard.ZstdCompressor(level=ZSTD_COMPRESS_LEVEL)
        with timeme():
            bio = io.BytesIO()
            with compressor.stream_writer(bio, closefd=False) as writer:
                writer.write(uncompressed)
                writer.close()
            print("One-stream", len(bio.getvalue()))
            return bio.getvalue()
    
    
    def rightsize():
        compressor = zstandard.ZstdCompressor(level=ZSTD_COMPRESS_LEVEL)
        with timeme():
            bio = io.BytesIO()
            with compressor.stream_writer(
                bio, closefd=False, size=len(uncompressed)
            ) as writer:
                writer.write(uncompressed)
                writer.close()
            print("Right-size", len(bio.getvalue()))
            return bio.getvalue()
    
    
    def multistream():
        compressor = zstandard.ZstdCompressor(level=ZSTD_COMPRESS_LEVEL)
        with timeme():
            bio = io.BytesIO()
            with compressor.stream_writer(bio, closefd=False) as writer:
                CHUNK = len(uncompressed) // 10
                for i in range(0, len(uncompressed), CHUNK):
                    writer.write(uncompressed[i : i + CHUNK])
                writer.close()
            print("Chunked stream", len(bio.getvalue()))
            return bio.getvalue()
    
    
    def wrongsize():
        # This one's easy - you get an exception
        compressor = zstandard.ZstdCompressor(level=ZSTD_COMPRESS_LEVEL)
        with timeme():
            bio = io.BytesIO()
            with compressor.stream_writer(
                bio, size=len(uncompressed) + 100, closefd=False
            ) as writer:
                writer.write(uncompressed)
                writer.close()
    
            print("Wrong-size", len(bio.getvalue()))
    
    
    has_size = oneshot()
    
    no_size = onestream()
    
    rightsize()
    
    multistream()
    
    oneshot()
    
    
    def d1():
        decompress = zstandard.ZstdDecompressor()
        assert uncompressed == decompress.decompress(has_size)
    
    
    d1()
    
    
    def d2():
        # the decompress.decompress() API errors with zstd.ZstdError: could not
        # determine content size in frame header
        decompress = zstandard.ZstdDecompressor().stream_reader(no_size)
        assert uncompressed == decompress.read()
    
    
    d2()
    
    
    def d3():
        # streaming decompression with sized input
        decompress = zstandard.ZstdDecompressor().stream_reader(has_size)
        assert uncompressed == decompress.read()
    
    
    d3()