Search code examples
pythonreplaceutf-8decode

UTF-8 decoding doesn't decode special characters in python


Hi I have the following data (abstracted) that comes from an API.

"Product" : "T\u00e1bua 21X40"

I'm using the following code to decode the data byte:

var = json.loads(cleanhtml(str(json.dumps(response.content.decode('utf-8')))))

The cleanhtml is a regex function that I've created to remove html tags from the returned data (It's working correctly). Although, decode(utf-8) is not removing characters like \u00e1. My expected output is:

"Product" : "Tábua 21X40"

I've tried to use replace("\\u00e1", "á") but with no success. How can I replace this type of character and what type of character is this?


Solution

  • \u00e1 is another way of representing the á character when displaying the contents of a Python string.

    If you open a Python interactive session and run print({"Product" : "T\u00e1bua 21X40"}) you'll see output of {'Product': 'Tábua 21X40'}. The \u00e1 doesn't exist in the string as those individual characters.

    The \u escape sequence indicates that the following numbers specify a Unicode character.

    Attempting to replace \u00e1 with á won't achieve anything because that's what it already is. Additionally, replace("\\u00e1", "á") is attempting to replace the individual characters of a slash, a u, etc and, as mentioned, they don't actually exist in the string in that way.

    If you explain the problem you're encountering further then we may be able to help more, but currently it sounds like the string has the correct content but is just being displayed differently than you expect.