Search code examples
pythonbzip2

How to get the internal position while reading bzip2 file


I've got a script to decompress and parse data contained in a bunch of very large bzip2 compressed files. Since it can take a while I'd like to have some way to monitor the progress. I know I can get the file size with os.path.getsize(), but bz2.BZ2File.tell() returns the position within the uncompressed data. Is there any way to get the current position within the uncompressed file so I can monitor the progress?

Bonus points if there's a python equivalent to Java's ProgressMonitorInputStream.


Solution

  • This is the solution I came up with that seems to work.

    import bz2
    
    class SimpleBZ2File(object):
    
        def __init__(self,path,readsize=1024):
            self.decomp = bz2.BZ2Decompressor()
            self.rawinput = open(path,'rb')
            self.eof = False
            self.readsize = readsize
            self.leftover = ''
    
        def tell(self):
            return self.rawinput.tell()
    
        def __iter__(self):
            while not self.eof:
                rawdata = self.rawinput.read(self.readsize)
                if rawdata == '':
                    self.eof = True
                else:
                    data = self.decomp.decompress(rawdata)
                    if not data:
                        continue #we need to supply more raw to decompress
                    newlines = list(data.splitlines(True))
                    yield self.leftover + newlines[0]
                    self.leftover = ''
                    for l in newlines[1:-1]:
                        yield l
                    if newlines[-1].endswith('\n'):
                        yield newlines[-1]
                    else:
                        self.leftover = newlines[-1]
            if self.leftover:
                yield self.leftover
            self.rawinput.close()