everyone, i run into a trouble when trying to open a HTML file containing Chinese characters, here is the code
#problem with chinese character
file =wget.download("http://nba.stats.qq.com/player/list.htm#teamId=1")
with open(file,encoding ='utf-8') as f:
html = f.read()
print(html)
However in the output I get error as follows
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 535: invalid continuation byte
I searched for a while , and i saw some similar issues, but the solutions seems to use latin-1, which is obvious not the case here, I'm not sure how which encoding to use?
any suggestions? thanks ~
The page you are referring to is not encoded in UTF-8 encoding, but in GBK. You can tell by looking at the header:
<meta charset="GBK">
If you specify encoding='gbk'
it'll work.
On another note, I would opt for not using wget
unless you have to, and instead going with urllib
which comes with the Python Standard Library. It also saved the disk write, and the code is simpler:
import urllib.request
with urllib.request.urlopen("http://nba.stats.qq.com/player/list.htm") as file:
html = file.read()
print(html.decode('gbk'))