Search code examples
jsonpython-3.xcharacter-encoding

Weird behavior with accents, strings and json


I have JSON data with accents in it and got some unexpected results: Sometimes characters are replaced with � as in wrong encoding, but not every time.

Here are reproductions of the problem and some scenarios I tried to understand the logic

I'm using Python 3.11.0. All my files are encoded using UTF-8.

Here are the tests I did to try to understand the logic/bug.

The problem:

# Content of file1.json:
# {"a": "à"}
# Hex dump: 7B 22 61 22 3A 20 22 C3 A0 22 7D

with open("file1.json", "r") as file:
    text = file.read()

print(text) # {"a": "à"}
print(json.loads(text)) # {'a': '�\xa0'}

But this works

# Content of file2.json:
# "à"
# Hex dump: 22 C3 A0 22

with open("file2.json", "r") as file:
    text = file.read()

print(text) # "à"
print(json.loads(text)) # à

I also tried opening with UTF-8 (which I believe is the default?) and this doesn't work.

with open("file2.json", "r", encoding="utf-8") as file:
    text = file.read()

print(text) # "�"
print(json.loads(text)) # �

Then I tried just with strings and it confused me even further

text = '"à"'
print(text) # "�"
print(json.loads(text)) # �
text = '{"a": "à"}'
print(text) # {"a": "�"}
print(json.loads(text)) # {'a': '�'}

Solution

  • Ok, I found where the problem was: vs code output was not using utf-8

    I don't know why the output doesn't when the terminal does, but I switched to the terminal.

    Now, the results are what I expected:
    Note that the encoding parameter is needed

    with open("file2.json", "r", encoding="utf-8") as file:
        text = file.read()
    
    print(text) # "{"a": "à"}"
    print(json.loads(text)) # {'a': 'à'}