The zstd
compressor can operate in streaming mode, or the total size to be compressed can be given in advance (for example, with the size parameter in this Python binding
How does the library behave when the size is given in advance? Is it faster, or does it use less memory or compress more effectively? What happens when you compress more, or less data than the given size?
I tested the python-zstandard
library against the Silesia Corpus' dickens text.
Compression takes about the same amount of time whether the size is known or unknown. The compressor produces the same number of bytes, plus a 3-byte header, for this 10MB file.
If you tell the compressor the wrong number of bytes, it just fails when it is given more or less input than expected.
If the size was not known on compression, you have to use the streaming decompression API instead of the one-shot .decompress(bytes) API, but I could be missing a flush frame / close frame command.
We choose level 22 so that the memory differences will be more apparent. At more reasonable levels <= 19, memory usage is < 100MB on compression, and < 20MB on decompression - demonstrating why the command line tool guards extreme compression levels with a flag.
According to the scalene profiler, at level 22,
peak memory | function |
---|---|
267MB | oneshot |
777MB | onestream |
266MB | rightsize |
774MB | multistream |
decompression peak memory | function |
---|---|
9.9MB | one-shot decompression |
128.5MB | streaming decompression, size unknown |
19.3MB | streaming decompression, size known |
(fails) | one-shot decompression, size unknown |
"""
Test zstd with different options and data sizes.
"""
import pathlib
import zstandard
import time
import io
import contextlib
@contextlib.contextmanager
def timeme():
start = time.monotonic()
yield
end = time.monotonic()
print(f"{end-start}s")
# The Collected works of Charles Dickens from the Silesia corpus
uncompressed = pathlib.Path("dickens").read_bytes()
ZSTD_COMPRESS_LEVEL = 22
def oneshot():
compressor = zstandard.ZstdCompressor(level=ZSTD_COMPRESS_LEVEL)
with timeme():
result = compressor.compress(uncompressed)
print("One-shot", len(result))
return result
def onestream():
compressor = zstandard.ZstdCompressor(level=ZSTD_COMPRESS_LEVEL)
with timeme():
bio = io.BytesIO()
with compressor.stream_writer(bio, closefd=False) as writer:
writer.write(uncompressed)
writer.close()
print("One-stream", len(bio.getvalue()))
return bio.getvalue()
def rightsize():
compressor = zstandard.ZstdCompressor(level=ZSTD_COMPRESS_LEVEL)
with timeme():
bio = io.BytesIO()
with compressor.stream_writer(
bio, closefd=False, size=len(uncompressed)
) as writer:
writer.write(uncompressed)
writer.close()
print("Right-size", len(bio.getvalue()))
return bio.getvalue()
def multistream():
compressor = zstandard.ZstdCompressor(level=ZSTD_COMPRESS_LEVEL)
with timeme():
bio = io.BytesIO()
with compressor.stream_writer(bio, closefd=False) as writer:
CHUNK = len(uncompressed) // 10
for i in range(0, len(uncompressed), CHUNK):
writer.write(uncompressed[i : i + CHUNK])
writer.close()
print("Chunked stream", len(bio.getvalue()))
return bio.getvalue()
def wrongsize():
# This one's easy - you get an exception
compressor = zstandard.ZstdCompressor(level=ZSTD_COMPRESS_LEVEL)
with timeme():
bio = io.BytesIO()
with compressor.stream_writer(
bio, size=len(uncompressed) + 100, closefd=False
) as writer:
writer.write(uncompressed)
writer.close()
print("Wrong-size", len(bio.getvalue()))
has_size = oneshot()
no_size = onestream()
rightsize()
multistream()
oneshot()
def d1():
decompress = zstandard.ZstdDecompressor()
assert uncompressed == decompress.decompress(has_size)
d1()
def d2():
# the decompress.decompress() API errors with zstd.ZstdError: could not
# determine content size in frame header
decompress = zstandard.ZstdDecompressor().stream_reader(no_size)
assert uncompressed == decompress.read()
d2()
def d3():
# streaming decompression with sized input
decompress = zstandard.ZstdDecompressor().stream_reader(has_size)
assert uncompressed == decompress.read()
d3()