Search code examples
pythonunicodedecodecp1252

how to convert u'\uf04a' to unicode in python


I am trying to decode u'\uf04a' in python thus I can print it without error warnings. In other words, I need to convert stupid microsoft Windows 1252 characters to actual unicode

The source of html containing the unusual errors comes from here http://members.lovingfromadistance.com/showthread.php?12338-HAVING-SECOND-THOUGHTS

Read about u'\uf04a' and u'\uf04c' by clicking here http://www.fileformat.info/info/unicode/char/f04a/index.htm

one example looks like this:

"Oh god please some advice ":

Out[408]: u'Oh god please some advice \uf04c'

Given a thread like this as one example for test:

thread = u'who are you \uf04a Why you are so harsh to her \uf04c'
thread.decode('utf8')

print u'\uf04a'
print u'\uf04a'.decode('utf8') # error!!!

'charmap' codec can't encode character u'\uf04a' in position 1526: character maps to undefined

With the help of two Python scripts, I successfully convert the u'\x92', but I am still stuck with u'\uf04a'. Any suggestions?

References

https://github.com/AnthonyBRoberts/NNS/blob/master/tools/killgremlins.py

Handling non-standard American English Characters and Symbols in a CSV, using Python

Solution:

According to the comments below: I replace these character set with the question mark('?')

thread = u'who are you \uf04a Why you are so harsh to her \uf04c'
thread = thread.replace(u'\uf04a', '?')
thread = thread.replace(u'\uf04c', '?')

Hope this helpful to the other beginners.


Solution

  • u'\uf04a'
    

    already is a Unicode object, which means there's nothing to decode. The only thing you can do with it is encode it, if you're targeting a specific file encoding like UTF-8 (which is not the same as Unicode, but is confused with it all the time).

    u'\uf04a'.encode("utf-8")
    

    gives you a string (Python 2) or bytes object (Python 3) which you can then write to a file or a UTF-8 terminal etc.

    You won't be able to encode it as a plain Windows string because cp1252 doesn't have that character.

    What you can do is convert it to an encoding that doesn't have those offending characters by telling the encoder to replace missing characters by ?:

    >>> u'who\uf04a why\uf04c'.encode("ascii", errors="replace")
    'who? why?'