Search code examples
python-3.xurlopen

Python3 urlopen read weirdness (gzip)


I'm getting an URL from Schema.org. It's content-type="text/html"

Sometimes, read() functions as expected b'< !DOCTYPE html> ....'

Sometimes, read() returns something else b'\x1f\x8b\x08\x00\x00\x00\x00 ...'

try:
    with urlopen("http://schema.org/docs/releases.html") as f:
        txt = f.read()
except URLError:
    return

I've tried solving this with txt = f.read().decode("utf-8").encode() but this results in an error... sometimes: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

The obvious work-around is to test if the first byte is hex and treat this accordingly.

My question is: Is this a bug or something else?

enter image description here Edit Related question. Apparently, sometimes I'm getting a gzipped stream.

Lastly I solved this by adding the following code as proposed here

if 31 == txt[0]:
    txt = decompress(txt, 16+MAX_WBITS)

The question remains; why does this return text/html sometimes and zipped some other times?


Solution

  • There are other questions in this category, but I cannot find an answer that addresses the actual cause of the problem.

    Python's urllib2.urlopen() cannot transparently handle compression. It also by default does not set the Accept-Encoding request header. Additionally, the interpretation of this situation according to the HTTP standard has changed in the past.

    As per RFC2616:

    If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding. In this case, if "identity" is one of the available content-codings, then the server SHOULD use the "identity" content-coding, unless it has additional information that a different content-coding is meaningful to the client.

    Unfortunately (as for the use case), RFC7231 changes this to

    If no Accept-Encoding field is in the request, any content-coding is considered acceptable by the user agent.

    Meaning, when performing a request using urlopen() you can get a response in whatever encoding the server decides to use and the response will be conformant.

    schema.org seems to be hosted by google, i.e. it is most likely behind a distributed frontend load balancer network. So the different answers you get might be returned from load balancers with slightly different configurations.

    Google Engineers have in the past advocated for the use HTTP compression, so this might as well be a conscious decision.

    So as a lesson: when using urlopen() we need to set Accept-Encoding.