Search code examples
pythongzipzlibcommon-crawl

Python's zlib doesn't work on CommonCrawl file


I was trying to unzip a file using Python's zlib and it doesn't seem to work. The file is 100MB from Common Crawl and I downloaded it as wet.gz. When I unzip it on the terminal with gunzip, everything works fine, and here're the first few lines of the output:

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2022-08-20T09:26:35Z
WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
WARC-Record-ID: <urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d>
Content-Type: application/warc-fields
Content-Length: 371

Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2022-33
operator: Common Crawl Admin ([email protected])
description: Wide crawl of the web for August 2022
publisher: Common Crawl



WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://100bravert.main.jp/public_html/wiki/index.php?cmd=backup&action=nowdiff&page=Game_log%2F%EF%BC%A7%EF%BC%AD%E6%9F%98&age=53
WARC-Date: 2022-08-07T15:32:56Z
WARC-Record-ID: <urn:uuid:8dd329bf-6717-4d0c-ae05-93445c59fd50>
WARC-Refers-To: <urn:uuid:1e2e972b-4273-468a-953f-28b0e45fb117>
WARC-Block-Digest: sha1:GTEJAN2GXLWBXDRNUEI3LLEHDIPJDPTU
WARC-Identified-Content-Language: jpn
Content-Type: text/plain
Content-Length: 12482

Game_log/GM柘 のバックアップの現在との差分(No.53) - PukiWiki
Game_log/GM柘 のバックアップの現在との差分(No.53)
[ トップ ] [ 新規 | 一覧 | 単語検索 | 最終更新 | ヘルプ ]
バックアップ一覧

However, when I try to use Python's gzip or zlib library, using these code examples:

# using gzip
fh = gzip.open('wet.gz', 'rb')
data = fh.read(); fh.close()

# using zlib
o = zlib.decompressobj(zlib.MAX_WBITS|16)
result = []
result = [o.decompress(open("wet.gz", "rb").read()), o.flush()]

Both of them return this:

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2022-08-20T09:26:35Z
WARC-Filename: CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
WARC-Record-ID: <urn:uuid:3f9035e8-8038-4239-a566-c9410b93956d>
Content-Type: application/warc-fields
Content-Length: 371

Software-Info: ia-web-commons.1.1.10-SNAPSHOT-20220804021208
Extracted-Date: Sat, 20 Aug 2022 09:26:35 GMT
robots: checked via crawler-commons 1.4-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
isPartOf: CC-MAIN-2022-33
operator: Common Crawl Admin ([email protected])
description: Wide crawl of the web for August 2022
publisher: Common Crawl




​

So apparently, they can decompress the first few paragraphs just fine, but all other paragraphs below it are lost. Is this a bug in Python's zlib/gzip library?

Edit for future readers: I've integrated the accepted answer to my Python package if you don't want to mess around:

pip install k1lib
from k1lib.imports import *
lines = cat("wet.gz", text=False, chunks=True) | unzip(text=True)
for line in lines:
    print(line)

This will read the file in binary mode chunk by chunk, unzips them incrementally, split up into multiple lines and convert them into strings.


Solution

  • Your wet.gz consists of 31,849 gzip members, concatenated. Per the gzip standard, valid gzip streams concatenated is a valid gzip stream.

    Python's decompressobj() is not automatically continuing to read and decompress the gzip members after the first. Yes, I would consider this to be a bug, since it is not complying with the gzip standard. Though this is a common failure to comply.

    The workaround is simple. Put the Python decompression in a loop, continuing to decompress until the input is consumed. o.unused_data will return the unused input leftover after decompressing the last member, for use in decompressing the next member.

    import zlib
    f = open("wet.gz", "rb")
    o = zlib.decompressobj(zlib.MAX_WBITS + 16)
    data = left = b''
    while True:
        got = f.read(32768)
        data += o.decompress(left + got)
        left = b''
        if o.eof:
            left = o.unused_data
            o = zlib.decompressobj(zlib.MAX_WBITS + 16)
        if len(got) == 0 and len(left) == 0:
            break
    f.close()
    

    (That also avoids loading the entire input into memory. For illustration, it accumulates the entire output in memory, but if possible that data should be processed as it arrives instead.)

    Python's gzip.read() works for me on wet.gz, decompressing the whole thing. Perhaps you have an older version of Python.