I want to read this webpage:
http://www.stats.gov.cn/tjsj/zxfb/202210/t20221014_1889255.html
If I use pd.read_html the content usually loads properly, but recently, I have started getting an HTTP Error 400: Bad Request.
So I tried to use:
link = 'http://www.stats.gov.cn/tjsj/zxfb/202210/t20221014_1889255.html'
header = {'User-Agent': 'Mozilla/5.0'}
r = requests.get(link, headers=header)
df = pd.read_html(r.text, encoding='utf-8')[1]
which gets over the 400 error, but the Chinese characters aren't readable, as the screenshot shows.
Why does this encoding problem occur in requests v pd.read_html, and how can I solve it? Thanks
I think I've solved it. Use r.content rather than r.text