Search code examples
pythonpandasencodingutf-8

python utf-8 encoding with pandas


i'm having an issue best demonstrated with this webpage https://www.basketball-reference.com/draft/NBA_2018.html which per document.charset is encoded in 'utf-8'. i use the following code

html = requests.get("https://www.basketball-reference.com/draft/NBA_2018.html", headers={"User-Agent": "XY"}).content
df_list = pandas.read_html(html)

at which point df_list[0] correctly shows the third pick's name as Dončić in the console. okay so far so good, but what i want to do is output this table to a csv file, so i do

with open('C:/Users/Eric/br2.csv', 'a', encoding='utf-8') as f:
 df_list[0].to_csv(f, header=True, encoding='utf-8')

which prints the name as DonÄić. this also happens if i use the encoding 'utf-8-sig', the open doesn't work at all if i use the encoding 'latin1' or don't put an encoding on it. if i try simply printing instead of using .to_csv i still get DonÄić. if i use requests.get().text it ends up being DonÄÂić.

my question is: i've got the information extracted and properly formatted in python, how do it get it properly formatted in a file?

thanks!

edited to add: thanks to Milos and Mark, i've discovered there's something much weirder going on haha. if i try to use Excel or Notepad set to utf-8 encoding, they still don't show it correctly, but i can see it correctly in Open Office set to utf-8. that's not the weird part. the weird part is if i copy that string Luka Dončić to notepad and save it as a new csv, Excel and Notepad both DO show it correctly when opening with utf-8 encoding. as best i can tell, something about the Python to.csv function is just making a weird csv, which seems impossible but there's no way around it lol. anyway, chitown88's solution is much easier so to anyone searching later, i recommend that!


Solution

  • Using 'utf-8-sig' worked fine.

    import pandas
    
    df = pandas.read_html("https://www.basketball-reference.com/draft/NBA_2018.html", header=1)[0]
    df.to_csv('output.csv', encoding='utf-8-sig')