The crux here is that this is a huge file. My goal is to avoid reading the entire file into memory at once, AND avoid parsing every line in a loop to get to the line I need (because it takes forever. The file is literally 15 million lines long).
What I'm currently doing is opening the file as...
self._FH = gzip.open(filename, "rb")
...moving the pointer directly to the location of the needed line (using many shenanigans, but it works) and reading in the individual line.
The lines similar to below (although these examples come from the beginning of the file, for ease and information sake)...
b'BAM\x01\x17\x18\x00\x00@HD\tVN:1.0\tSO:coordinate\n'
b'@SQ\tSN:1\tLN:248956422\n'
b'@SQ\tSN:10\tLN:133797422\n'
b'@SQ\tSN:11\tLN:135086622\n'
b'@SQ\tSN:12\tLN:133275309\n'
b'@SQ\tSN:13\tLN:114364328\n'
b'@SQ\tSN:14\tLN:107043718\n'
b'@SQ\tSN:15\tLN:101991189\n'
b'@SQ\tSN:16\tLN:90338345\n'
b'@SQ\tSN:17\tLN:83257441\n'
b'@SQ\tSN:18\tLN:80373285\n'
Some might notice this is a BAM
file, so if there's a better way to do this, suggestions welcome ...although the samtools
filters won't accomplish what I need. I have to seek by line, not by data.
A simple approach would be to take advantage of the fact that a concatenation of valid gzip streams is a gzip stream. Then when compressing you can compress chunks of lines into individual gzip streams and note the starting location of the gzip stream in the file, and the line number of the first line compressed in that stream. Then you can just jump to that location and start decompressing from there. If your chunks are on the order of a megabyte (around 50,000 lines), then there should be relatively little reduction in the compression ratio. Then on average you would need to decompress 25,000 lines to get to any given line, instead of 7.5 million lines.
If you are not in control of the creation of the gzip file, and can't recreate it to your needs, then you can index an existing gzip file using the approach used in zran.c. You can specify how close you want your access points to be and it will build an index that allows access starting at each of those points. You would also need to build an index to your line starts (as you would for an uncompressed file), to associate those with byte offsets into the uncompressed data.