I tried parsing a web page using urllib.request
's urlopen()
method, like:
from urllib.request import Request, urlopen
req = Request(url)
html = urlopen(req).read()
However, the last line returned the result in bytes.
So I tried decoding it, like:
html = urlopen(req).read().decode("utf-8")
However, the error occurred:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte.
With some research, I found one related answer, which parses charset
to decide the decode. However, the page doesn't return the charset, and when I tried checking it on Chrome Web Inspector, the following line was written in its header:
<meta charset="utf-8">
So why can I not decode it with utf-8
? And how can I parse the web page successfully?
The web site URL is http://www.vogue.com/fashion-shows/fall-2016-menswear/fendi/slideshow/collection#2
, where I want to save the image to my disk.
Note that I use Python 3.5.1. I also note that all the work I wrote above have functioned well in my other scraping programs.
The content is compressed with gzip
. You need to decompress it:
import gzip
from urllib.request import Request, urlopen
req = Request(url)
html = gzip.decompress(urlopen(req).read()).decode('utf-8')
If you use requests
, it will uncompress automatically for you:
import requests
html = requests.get(url).text # => str, not bytes