Search code examples
pythonlarge-filesbzip2tarfile

Read large file header (~9GB) inside tarfile without full extraction


I have ~1GB *.tbz files. Inside each of those files there is a single ~9GB file. I just need to read the header of this file, the first 1024 bytes.

I want this to do this as fast as possible as I have hundreds of this 1GB files I want to process. It takes about 1m30s to extract.

I tried using full extraction:

tar = tarfile.open(fn, mode='r|bz2')
for item in tar:
    tar.extract(item)

and tarfile.getmembers() but with no speed imprevement:

tar = tarfile.open(fn, mode='r|bz2')
for member in tar.getmembers():
    f = tar.extractfile(member)
    headerbytes = f.read(1024)
    headerdict = parseHeader(headerbytes)

The getmembers() method is what's taking all the time there.

Is there any way I can to this?


Solution

  • I think you should use the standard library bz2 interface. .tbz is the file extension for tar files that are compressed with the -j option to specify a bzip2 format.

    As @bbayles pointed out in the comments, you can open your file as a bz2.BZ2File and use seek and read:

    read([size])

    Read at most size uncompressed bytes, returned as a string. If the size argument is negative or omitted, read until EOF is reached.

    seek(offset[, whence])

    Move to new file position. Argument offset is a byte count.

    f = bz2.BZ2File(path)
    f.seek(512) 
    headerbytes = f.read(1024)
    

    You can then parse that with your functions.

    headerdict = parseHeader(headerbytes)