I have ~1GB *.tbz files. Inside each of those files there is a single ~9GB file. I just need to read the header of this file, the first 1024 bytes.
I want this to do this as fast as possible as I have hundreds of this 1GB files I want to process. It takes about 1m30s to extract.
I tried using full extraction:
tar = tarfile.open(fn, mode='r|bz2')
for item in tar:
tar.extract(item)
and tarfile.getmembers()
but with no speed imprevement:
tar = tarfile.open(fn, mode='r|bz2')
for member in tar.getmembers():
f = tar.extractfile(member)
headerbytes = f.read(1024)
headerdict = parseHeader(headerbytes)
The getmembers()
method is what's taking all the time there.
Is there any way I can to this?
I think you should use the standard library bz2
interface. .tbz
is the file extension for tar
files that are compressed with the -j
option to specify a bzip2
format.
As @bbayles pointed out in the comments, you can open your file as a bz2.BZ2File
and use seek
and read
:
read([size])
Read at most size uncompressed bytes, returned as a string. If the size argument is negative or omitted, read until EOF is reached.
seek(offset[, whence])
Move to new file position. Argument offset is a byte count.
f = bz2.BZ2File(path)
f.seek(512)
headerbytes = f.read(1024)
You can then parse that with your functions.
headerdict = parseHeader(headerbytes)