Search code examples
pythongzipzlibcommon-crawl

Streaming in a gzipped file from s3 in python


Hi I'm working on a project for fun with the common crawl data I have a subset of the most current crawls warc file paths from here

so basically I have a url like https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-45/segments/1603107863364.0/warc/CC-MAIN-20201019145901-20201019175901-00000.warc.gz (the first url in the warc paths) and I am streaming in the request like so:

s = requests.Session()

resp = s.get(url, headers=headers, stream=True)
print(resp.status_code)
for line in stream_gzip_decompress(resp):
     print(line.decode('utf-8'))

def stream_gzip_decompress(stream):
   dec = zlib.decompressobj( 32+ zlib.MAX_WBITS)  # offset 32 to skip the header
   for chunk in stream:
      rv = dec.decompress(chunk)
      if rv:
          yield rv

stream_gzip_decompress from Python unzipping stream of bytes?

The first three chunks seem to decompress fine and print out then the script just hangs forever (I only waited about 8 minutes. It seems to still be running through the chunks but gets caught on the if rv: line so does not yield anything, but does seem to be streaming in bytes still.


Solution

  • Why not use a WARC parser library (I'd recommend warcio) to do the parsing including gzip decompression?

    Alternatively, have a look at gzipstream to read from a stream of gzipped content and decompress the data on the fly.