I want to decompress a huge gz file (wikidata json dump latest-all.json.gz
, 104GB compressed) on the fly in python with gzip.open
.
It works fine for a while. However, after reading 39.7 million lines it yields the error:
zlib.error: Error -3 while decompressing data: too many length or distance symbols
The function where I do the decompressing and reading looks like this:
import gzip
...
def wikidata(filename):
with gzip.open(filename, mode='rt') as f:
f.read(2) # skip first two bytes: "{\n"
for line in f:
try:
yield json.loads(line.rstrip(',\n'))
except json.decoder.JSONDecodeError:
continue
The error in full is:
Traceback (most recent call last):
File "parse.py", line 95, in <module>
for line in lines:
File "parse.py", line 21, in wikidata
for line in f:
File "/usr/lib/python3.8/gzip.py", line 305, in read1
return self._buffer.read1(size)
File "/usr/lib/python3.8/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/usr/lib/python3.8/gzip.py", line 487, in read
uncompress = self._decompressor.decompress(buf, size)
zlib.error: Error -3 while decompressing data: too many length or distance symbols
What can be the reason for this? How can I solve the problem?
It means that the compressed data is corrupted at that point, or some short distance before it. The only way to solve the problem is to replace the input with a gzip file that is not corrupted.