Search code examples
pythonpandascharacter-encoding

Chinese character encoding - pd.read_html v requests


I want to read this webpage:

http://www.stats.gov.cn/tjsj/zxfb/202210/t20221014_1889255.html

If I use pd.read_html the content usually loads properly, but recently, I have started getting an HTTP Error 400: Bad Request.

So I tried to use:

link = 'http://www.stats.gov.cn/tjsj/zxfb/202210/t20221014_1889255.html'
header = {'User-Agent': 'Mozilla/5.0'}
r = requests.get(link, headers=header)
df = pd.read_html(r.text, encoding='utf-8')[1]

which gets over the 400 error, but the Chinese characters aren't readable, as the screenshot shows.

Why does this encoding problem occur in requests v pd.read_html, and how can I solve it? Thanks

Screenshot


Solution

  • I think I've solved it. Use r.content rather than r.text