Possible Duplicate:
How can I tail a zipped file without reading its entire contents?
I have a 7GB gzip syslog file that extracts to over 25GB. I need to retrieve only the first and last lines of the file without reading the whole file into memory at once.
GzipFile()
in Python 2.7 permits use of with
to read the head (iterating via with
means I don't have to read the whole file):
>>> from itertools import islice
>>> from gzip import GzipFile
>>> with GzipFile('firewall.4.gz') as file:
... head = list(islice(file, 1))
>>> head
['Oct 2 07:35:14 192.0.2.1 %ASA-6-305011: Built dynamic TCP translation
from INSIDE:192.0.2.40/51807 to OUTSIDE:10.18.61.38/2985\n']
Python 2.6 version to avoid issues such as AttributeError: GzipFile instance has no attribute '__exit__'
(since GzipFile() doesn't support with
iteration on GzipFile())...
>>> from itertools import islice
>>> from gzip import GzipFile
>>> class GzipFileHack(GzipFile):
... def __enter__(self):
... return self
... def __exit__(self, type, value, tb):
... self.close()
>>> with GzipFileHack('firewall.4.gz') as file:
... head = list(islice(file, 1))
The problem with this is I have no way to retrieve the tail... islice()
doesn't support negative values, and I can't find the way to retrieve the last line without iterating through a 25GB file (which takes way too long).
What is the most efficient way to read the tail of a gzip text file without reading the whole file into memory or iterating over all the lines? If this can't be done, please explain why.
There is no way to do so. DEFLATE is a stream compression algorithm, which means that there is no way to decompress arbitrary parts of the file without having decompressed everything before it.