Hi I'm working on a project for fun with the common crawl data I have a subset of the most current crawls warc file paths from here
so basically I have a url like https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-45/segments/1603107863364.0/warc/CC-MAIN-20201019145901-20201019175901-00000.warc.gz (the first url in the warc paths) and I am streaming in the request like so:
s = requests.Session()
resp = s.get(url, headers=headers, stream=True)
print(resp.status_code)
for line in stream_gzip_decompress(resp):
print(line.decode('utf-8'))
def stream_gzip_decompress(stream):
dec = zlib.decompressobj( 32+ zlib.MAX_WBITS) # offset 32 to skip the header
for chunk in stream:
rv = dec.decompress(chunk)
if rv:
yield rv
stream_gzip_decompress from Python unzipping stream of bytes?
The first three chunks seem to decompress fine and print out then the script just hangs forever (I only waited about 8 minutes. It seems to still be running through the chunks but gets caught on the if rv:
line so does not yield anything, but does seem to be streaming in bytes still.
Why not use a WARC parser library (I'd recommend warcio) to do the parsing including gzip decompression?
Alternatively, have a look at gzipstream to read from a stream of gzipped content and decompress the data on the fly.