Search code examples
pythonunicode

Python - Issues with Unicode String from API Call


I'm using Python to call an API that returns the last name of some soccer players. One of the players has a "ć" in his name.

When I call the endpoint, the name prints out with the unicode attached to it:

>>> last_name = (json.dumps(response["response"][2]["player"]["lastname"]))

>>> print(last_name)

"Mitrovi\u0107"

>>> print(type(last_name))

<class 'str'>

If I were to take copy and paste that output and put it in a variable on its own like so:

>>> print("Mitrovi\u0107")

Mitrović

>>> print(type("Mitrovi\u0107"))

<class 'str'>

Then it prints just fine?

What is wrong with the API endpoint call and the string that comes from it?


Solution

  • Well, you serialise the string with json.dumps() before printing it, that's why you get a different output. Compare the following:

    >>> print("Mitrović")
    Mitrović
    

    and

    >>> print(json.dumps("Mitrović"))
    "Mitrovi\u0107"
    

    The second command adds double quotes to the output and escapes non-ASCII chars, because that's how strings are encoded in JSON. So it's possible that response["response"][2]["player"]["lastname"] contains exactly what you want, but maybe you fooled yourself by wrapping it in json.dumps() before printing.

    Note: don't confuse Python string literals and JSON serialisation of strings. They share some common features, but they aren't the same (eg. JSON strings can't be single-quoted), and they serve a different purpose (the first are for writing strings in source code, the second are for encoding data for sending it accross the network).

    Another note: You can avoid most of the escaping with ensure_ascii=False in the json.dumps() call:

    >>> print(json.dumps("Mitrović", ensure_ascii=False))
    "Mitrović"