How to get the time needed for decompressing large bz2 files?

I need to process large bz2 files (~6G) using Python, by decompressing it line-by-line, using BZ2File.readline(). The problem is that I want to know how much time is needed for processing the whole file.

I did a lot searches, tried to get the actual size of decompressed file, so that I can know the percentage processed on-the-fly, and hence the time remaining, while the finding is that it seems impossible to know the decompressed file size without decompressing it first (https://stackoverflow.com/a/12647847/7876675).

Besides that decompressing the file takes loads of memory, decompressing takes a lot of time itself. So, can anybody help me to get the remaining processing time on-the-fly?

Solution

You can estimate the time remaining based on the consumption of compressed data, instead of the production of uncompressed data. The result will be about the same, if the data is relatively homogenous. (If it isn't, then either using the input or the output won't give an accurate estimate anyway.)

You can easily find the size of the compressed file, and use the time spent on the compressed data so far to estimate the time to process the remaining compressed data.

Here is a simple example of using a BZ2Decompress object to operate on the input a chunk at a time, showing the read progress (Python 3, getting the file name from the command line):

# Decompress a bzip2 file, showing progress based on consumed input.

import sys
import os
import bz2
import time

def proc(input):
    """Decompress and process a piece of a compressed stream"""
    dat = dec.decompress(input)
    got = len(dat)
    if got != 0:    # 0 is common -- waiting for a bzip2 block
        # process dat here
        pass
    return got

# Get the size of the compressed bzip2 file.
path = sys.argv[1]
size = os.path.getsize(path)

# Decompress CHUNK bytes at a time.
CHUNK = 16384
totin = 0
totout = 0
prev = -1
dec = bz2.BZ2Decompressor()
start = time.time()
with open(path, 'rb') as f:
    for chunk in iter(lambda: f.read(CHUNK), b''):
        # feed chunk to decompressor
        got = proc(chunk)

        # handle case of concatenated bz2 streams
        if dec.eof:
            rem = dec.unused_data
            dec = bz2.BZ2Decompressor()
            got += proc(rem)

        # show progress
        totin += len(chunk)
        totout += got
        if got != 0:    # only if a bzip2 block emitted
            frac = round(1000 * totin / size)
            if frac != prev:
                left = (size / totin - 1) * (time.time() - start)
                print(f'\r{frac / 10:.1f}% (~{left:.1f}s left) ', end='')
                prev = frac

# Show the resulting size.
print(end='\r')
print(totout, 'uncompressed bytes')