Implement seek in read-only gzip stream

I have an app that seeks within a .gz file-like object.

Python's gzip.GzipFile supports this, but very inefficiently – when the GzipFile object is asked to seek back, it will rewind to the beginning of the stream (seek(0)) and then read and decompress everything up to the desired offset.

Needless to say this absolutely kills performance when seeking around a large tar.gz file (tens of gigabytes).

So I'm looking to implement checkpointing: store the stream state every now and then, and when asked to seek back, go only to the next previous stored checkpoint, instead of rewinding all the way to the beginning.

My question is around the gzip / zlib implementation: What does the "current decompressor state" consist of? Where is it stored? How big is it?

And how do I copy that state out of an open GzipFile object, and then assign it back for the "backward jump" seek?

Note I have no control over the input .gz files. The solution must be strictly for GzipFile in read-only rb mode.

EDIT: Looking at CPython's source, this is the relevant code flow & data structures. Ordered from top-level (Python) down to raw C:

gzip.GzipFile._buffer.raw
gzip._GzipReader
gzip._GzipReader.seek() == DecompressReader.seek() <=== NEED TO CHANGE THIS
ZlibDecompressor state + its deepcopy <=== NEED TO COPY / RESTORE THIS
z_stream struct
internal_state struct

EDIT2: Also found this teaser in zlib:

An access point can be created at the start of any deflate block, by saving the starting file offset and bit of that block, and the 32K bytes of uncompressed data that precede that block. Also the uncompressed offset of that block is saved to provide a reference for locating a desired starting point in the uncompressed stream.

Another way to build an index would be to use inflateCopy(). That would not be constrained to have access points at block boundaries, but requires more memory per access point, and also cannot be saved to file due to the use of pointers in the state.

(they call "access points" what I call "check points"; same thing)

This pretty much answers all my questions but I still need to find a way to translate this zran.c example to work with the gzip/zlib scaffolding in CPython.

Solution

You could try a library called indexed_gzip, which builds on top of the zlib's zran.c utility. Essentially, this library keeps a series of checkpoints throughout the file, and when a request for a specific byte offset arrives, it starts from the nearest checkpoint. (indexed_gzip calls this an "index seek point.")

Example usage from the documentation:

import indexed_gzip as igzip

# You can create an IndexedGzipFile instance by specifying a file name.
myfile = igzip.IndexedGzipFile('big_file.gz')

# Or by passing an open file handle. In this use, the file handle
# must be opened in read-only binary mode:
myfile = igzip.IndexedGzipFile(fileobj=fileobj, auto_build=True, spacing=1024**2)

# Write support is currently non-existent.

The auto_build mode (True by default) allows incremental index building: each call to seek(offset) expands the checkpoint index to encompass the offset, in case it's not already covered.

You'll probably want to tune the spacing parameter, which controls how many bytes are between each checkpoint in the uncompressed file content. This is a time-memory tradeoff: more checkpoints means that less work needs to be done on each seek, but this means that more memory is used for checkpoints. It defaults to 1MB of uncompressed data.

For faster startup, you can write the index out to disk (the index is much smaller than the underlying compressed file) and load the index the next time your program runs. See Index import/export for more about how to use this feature.