Search code examples
pythonpython-3.xdecodeurlliburlopen

urllib.request.urlopen return bytes, but I cannot decode it


I tried parsing a web page using urllib.request's urlopen() method, like:

from urllib.request import Request, urlopen
req = Request(url)
html = urlopen(req).read()

However, the last line returned the result in bytes.

So I tried decoding it, like:

html = urlopen(req).read().decode("utf-8")

However, the error occurred:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte.

With some research, I found one related answer, which parses charset to decide the decode. However, the page doesn't return the charset, and when I tried checking it on Chrome Web Inspector, the following line was written in its header:

<meta charset="utf-8">

So why can I not decode it with utf-8? And how can I parse the web page successfully?

The web site URL is http://www.vogue.com/fashion-shows/fall-2016-menswear/fendi/slideshow/collection#2, where I want to save the image to my disk.

Note that I use Python 3.5.1. I also note that all the work I wrote above have functioned well in my other scraping programs.


Solution

  • The content is compressed with gzip. You need to decompress it:

    import gzip
    from urllib.request import Request, urlopen
    
    req = Request(url)
    html = gzip.decompress(urlopen(req).read()).decode('utf-8')
    

    If you use requests, it will uncompress automatically for you:

    import requests
    html = requests.get(url).text  # => str, not bytes