Search code examples
encodingutf-8web-scrapingreadfilepython-3.6

unable to open html file with Chinese character


everyone, i run into a trouble when trying to open a HTML file containing Chinese characters, here is the code

#problem with chinese character
file =wget.download("http://nba.stats.qq.com/player/list.htm#teamId=1")
with open(file,encoding ='utf-8') as f:
    html = f.read()
    print(html) 

However in the output I get error as follows

    319         # decode input (taking the buffer into account)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call
    323         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 535: invalid continuation byte

I searched for a while , and i saw some similar issues, but the solutions seems to use latin-1, which is obvious not the case here, I'm not sure how which encoding to use?

any suggestions? thanks ~


Solution

  • The page you are referring to is not encoded in UTF-8 encoding, but in GBK. You can tell by looking at the header:

    <meta charset="GBK">
    

    If you specify encoding='gbk' it'll work.

    On another note, I would opt for not using wget unless you have to, and instead going with urllib which comes with the Python Standard Library. It also saved the disk write, and the code is simpler:

    import urllib.request
    
    with urllib.request.urlopen("http://nba.stats.qq.com/player/list.htm") as file:
        html = file.read()
        print(html.decode('gbk'))