Search code examples
pythonfilegzipreadline

Python: How to handle a corrupted gzip file in reading multiple files


I am reading a large set of gzip files. When I tried the below code, the process cannot be finished because some of files are corrupted. Python can open those corrupted files, but the process is interrupted due to errors in certain lines in those files.

    for file in files:
        try:
            fin=gzip.open(file,'rb')
        except:
            continue
        
        for line in fin:
            try:
                temp=line.decode().split(",")
                a,b,c,d=temp[0],int(temp[1]),int(temp[2]),int(temp[3])
            except:
                continue

But the program stops because of the following error. What is the best way to process a corrupted gzip file?

Traceback (most recent call last):---------------------------| 9.0% Complete
  File "/opt/anaconda3/lib/python3.7/gzip.py", line 374, in readline
    return self._buffer.readline(size)
  File "/opt/anaconda3/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/opt/anaconda3/lib/python3.7/gzip.py", line 463, in read
    if not self._read_gzip_header():
  File "/opt/anaconda3/lib/python3.7/gzip.py", line 411, in _read_gzip_header
    raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'rv')

I have modified the code as below, and seems like running well, but not sure if this is the best way to handle such cases. Because for certain cases, the program seems not terminate (I need to test more).

    for file in files:
        try:
            fin=gzip.open(file,'rb')
        except:
            continue
        
        line=True
        while line:
            try:
                line=fin.readline()
            except:
                continue
            try:
                temp=line.decode().split(",")
                a,b,c,d=temp[0],int(temp[1]),int(temp[2]),int(temp[3])
            except:
                continue

Solution

  • Your iteration over the file is outside of any try... except, so an exception raised here will terminate the program. If you have a single try...except around the whole thing, then it should work:

        for file in files:
            try:
                with gzip.open(file,'rb') as fin:
                    for line in fin:
                        temp = line.decode().split(",")
                        a,b,c,d = temp[0], int(temp[1]), int(temp[2]), int(temp[3])
            except (OSError, ValueError):
                continue
    

    Note also:

    • Only catching the specific exceptions that we would expect to occur with a bad file, not other things that should still terminate the program (e.g. KeyboardInterrupt). A bare except: is usually a bad idea.
    • It is better to use a with construct with gzip.open