I'm trying to write a pandas DataFrame containing unicode to json, but the built in .to_json
function escapes the non-ascii characters. How do I fix this?
Example:
import pandas as pd
df = pd.DataFrame([["τ", "a", 1], ["π", "b", 2]])
df.to_json("df.json")
This gives:
{"0":{"0":"\u03c4","1":"\u03c0"},"1":{"0":"a","1":"b"},"2":{"0":1,"1":2}}
Which differs from the desired result:
{"0":{"0":"τ","1":"π"},"1":{"0":"a","1":"b"},"2":{"0":1,"1":2}}
I have tried adding the force_ascii=False
argument:
import pandas as pd
df = pd.DataFrame([["τ", "a", 1], ["π", "b", 2]])
df.to_json("df.json", force_ascii=False)
But this gives the following error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u03c4' in position 11: character maps to <undefined>
This occurs on pandas versions 0.18 to 2.2+, on python 3.4 to 3.12+
Opening a file with the encoding set to utf-8, and then passing that file to the .to_json
function fixes the problem:
with open('df.json', 'w', encoding='utf-8') as file:
df.to_json(file, force_ascii=False)
gives the correct:
{"0":{"0":"τ","1":"π"},"1":{"0":"a","1":"b"},"2":{"0":1,"1":2}}
Note: it does still require the force_ascii=False
argument.