I am reading a large set of gzip
files. When I tried the below code, the process cannot be finished because some of files are corrupted. Python can open those corrupted files, but the process is interrupted due to errors in certain lines in those files.
for file in files:
try:
fin=gzip.open(file,'rb')
except:
continue
for line in fin:
try:
temp=line.decode().split(",")
a,b,c,d=temp[0],int(temp[1]),int(temp[2]),int(temp[3])
except:
continue
But the program stops because of the following error.
What is the best way to process a corrupted gzip
file?
Traceback (most recent call last):---------------------------| 9.0% Complete
File "/opt/anaconda3/lib/python3.7/gzip.py", line 374, in readline
return self._buffer.readline(size)
File "/opt/anaconda3/lib/python3.7/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/opt/anaconda3/lib/python3.7/gzip.py", line 463, in read
if not self._read_gzip_header():
File "/opt/anaconda3/lib/python3.7/gzip.py", line 411, in _read_gzip_header
raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'rv')
I have modified the code as below, and seems like running well, but not sure if this is the best way to handle such cases. Because for certain cases, the program seems not terminate (I need to test more).
for file in files:
try:
fin=gzip.open(file,'rb')
except:
continue
line=True
while line:
try:
line=fin.readline()
except:
continue
try:
temp=line.decode().split(",")
a,b,c,d=temp[0],int(temp[1]),int(temp[2]),int(temp[3])
except:
continue
Your iteration over the file is outside of any try
... except
, so an exception raised here will terminate the program. If you have a single try...except around the whole thing, then it should work:
for file in files:
try:
with gzip.open(file,'rb') as fin:
for line in fin:
temp = line.decode().split(",")
a,b,c,d = temp[0], int(temp[1]), int(temp[2]), int(temp[3])
except (OSError, ValueError):
continue
Note also:
KeyboardInterrupt
). A bare except:
is usually a bad idea.with
construct with gzip.open