I have JSON data with accents in it and got some unexpected results: Sometimes characters are replaced with � as in wrong encoding, but not every time.
Here are reproductions of the problem and some scenarios I tried to understand the logic
I'm using Python 3.11.0. All my files are encoded using UTF-8.
Here are the tests I did to try to understand the logic/bug.
The problem:
# Content of file1.json:
# {"a": "à"}
# Hex dump: 7B 22 61 22 3A 20 22 C3 A0 22 7D
with open("file1.json", "r") as file:
text = file.read()
print(text) # {"a": "à"}
print(json.loads(text)) # {'a': '�\xa0'}
But this works
# Content of file2.json:
# "à"
# Hex dump: 22 C3 A0 22
with open("file2.json", "r") as file:
text = file.read()
print(text) # "à"
print(json.loads(text)) # à
I also tried opening with UTF-8 (which I believe is the default?) and this doesn't work.
with open("file2.json", "r", encoding="utf-8") as file:
text = file.read()
print(text) # "�"
print(json.loads(text)) # �
Then I tried just with strings and it confused me even further
text = '"à"'
print(text) # "�"
print(json.loads(text)) # �
text = '{"a": "à"}'
print(text) # {"a": "�"}
print(json.loads(text)) # {'a': '�'}
Ok, I found where the problem was: vs code output was not using utf-8
I don't know why the output doesn't when the terminal does, but I switched to the terminal.
Now, the results are what I expected:
Note that the encoding parameter is needed
with open("file2.json", "r", encoding="utf-8") as file:
text = file.read()
print(text) # "{"a": "à"}"
print(json.loads(text)) # {'a': 'à'}