I'm trying to read a table from an URL with pandas but it's returning some weird value for the characters column:
# 3rd party apps use "pip install pandas lxml"
import pandas as pd
url = "https://docs.google.com/document/d/e/2PACX-1vRMx5YQlZNa3ra8dYYxmv-QIQ3YJe8tbI3kqcuC7lQiZm-CSEznKfN_HYNSpoXcZIV3Y_O3YoUB1ecq/pub"
c = pd.read_html(url)
print(c)
Output:
[ 0 1 2
0 x-coordinate Character y-coordinate
1 0 â 0
2 0 â 1
3 0 â 2
4 1 â 1
5 1 â 2
6 2 â 1
7 2 â 2
8 3 â 2]
When I print the specific characters I get this:
>>> c[0][1][1]
'â\x96\x88'
At first I assumed this was the hex number of the character but when I checked it I found that it wasn't. I'm not too sure what the significance of the â character is.
you can specify the encoding parameter in read_html()
to handle the special character.
You can try:
url = "https://docs.google.com/document/d/e/2PACX-1vRMx5YQlZNa3ra8dYYxmv-QIQ3YJe8tbI3kqcuC7lQiZm-CSEznKfN_HYNSpoXcZIV3Y_O3YoUB1ecq/pub"
c = pd.read_html(url, encoding='latin1')
Or
url = "https://docs.google.com/document/d/e/2PACX-1vRMx5YQlZNa3ra8dYYxmv-QIQ3YJe8tbI3kqcuC7lQiZm-CSEznKfN_HYNSpoXcZIV3Y_O3YoUB1ecq/pub"
c = pd.read_html(url, encoding='utf-8')