Search code examples
pythongzipfilesize

Reading lines from gzipped text file in Python and get number of original compressed bytes read


I have many gzipped text files I want to decompress and read on the fly (online) and process so I can save disk space and also time reading data from disk at the expense of time of decompressing online.

So I use gzip module as well as tqdm to track progress.

But how can I find out the size of original uncompressed file size in order to set total bytes (uncompressed) count to read before finish to track the progress? As far as I've concerned from searching the web this problem is hard to tackle in gzip for files larger than 4 gigabytes which is my case.

Or alternatively I should track the count of compressed bytes read, having total bytes count set with the size of compressed file.

How can I achive that?

Here is the code example below with comments also reflecting what I'm trying to achieve.

I am using Python 3.5 .

import gzip
import tqdm
import os

size = os.path.getsize('filename.gz')
pbar = tqdm.tqdm(total=size, unit='b', unit_scale=True, unit_divisor=1024)

with gzip.open('filename.gz', 'rt') as file:
    for line in file:
        bytes_uncompressed = len(line.encode('utf-8'))
        # but how can I get compressed bytes read count?
        # bytes_compressed = ...?

        # pbar.update(bytes_compressed)

Solution

  • You should be open to read the underlying file (in binary mode) f = open('filename.gz', 'rb'). Then open gzip file on top of of that. g = gzip.GzipFile(fileobj=f). You perform your read operations from g, and to tell how far you are, you cat f.tell() ask for position in the compressed file.

    EDIT2: BTW. of course you can also use tell() on the GzipFile instance to tell see how far along (bytes read) the uncompressed files you are.

    EDIT: Now I see that is only partial answer to your problem. You'd also need the total. There I am afraid you are a bit out of luck. Esp. for files over 4GB as you've noted. gzip keeps uncompressed size in the last four bytes, so you could jump there and read them and jump back (GzipFile does not seem to expose this information itself), but since it's four bytes, you can only store 4GB as the biggest number, rest just gets truncated to the lower 4B of the value. In that case, I am afraid you won't know, until go to the end.

    Anyways, above hint gives you current position compressed and uncompressed, hope that allows you to at least somewhat achieve what you've set out to do.