Search code examples
unixconcurrencystreaminggzip

Random access to gzipped files?


I have a very large file compressed with gzip sitting on disk. The production environment is "Cloud"-based, so the storage performance is terrible, but CPU is fine. Previously, our data processing pipeline began with gzip -dc streaming the data off the disk.

Now, in order to parallelise the work, I want to run multiple pipelines that each take a pair of byte offsets - start and end - and take that chunk of the file. With a plain file this could be achieved with head and tail, but I'm not sure how to do it efficiently with a compressed file; if I gzip -dc and pipe into head, the offset pairs that are toward the end of the file will involve wastefully seeking through the whole file as it's slowly decompressed.

So my question is really about the gzip algorithm - is it theoretically possible to seek to a byte offset in the underlying file or get an arbitrary chunk of it, without the full implications of decompressing the entire file up to that point? If not, how else might I efficiently partition a file for "random" access by multiple processes while minimising the I/O throughput overhead?


Solution

  • You can't do that with gzip, but you can do it with bzip2, which is block instead of stream-based - this is how the Hadoop DFS splits and parallelizes the reading of huge files with different mappers in its MapReduce algorithm. Perhaps it would make sense to re-compress your files as bz2 so you can take advantage of this; it would be easier than some ad-hoc way to chunk up the files.

    I found the patches that are implementing this in Hadoop, here: https://issues.apache.org/jira/browse/HADOOP-4012

    Here's another post on the topic: BZip2 file read in Hadoop

    Perhaps browsing the Hadoop source code would give you an idea of how to read bzip2 files by blocks.