Search code examples
pythonunicodedecodeencodemojibake

python3 decode str to utf8


I have a str variable in Python 3.6, which is as following:

\xc3\xa4\xc2\xb8\xc2\xad\xc3\xa5\xc2\x9b\xc2\xbd\xc3\xa6\xc2\xb0\xc2\x91\xc3\xa7\xc2\x94\xc2\x9f\xc3\xa9\xc2\x93\xc2\xb6\xc3\xa8\xc2\xa1\xc2\x8c

I want to decode the str to chinese, I first encode the str and then decode it, but it can't work, my code is as following:

str = '\xE4\xB8\xAD\xE5\x9B\xBD\xE6\xB0\x91\xE7\x94\x9F\xE9\x93\xB6\xE8\xA1\x8C'
str.encode('utf-8').decode('unicode_escape')

the output is as following:

ä¸Â\xadÃ¥Â\x9b½æ°Â\x91çÂ\x94Â\x9féÂ\x93¶è¡Â\x8c

Solution

  • Looks like latin-1 mojibake, UTF-8 encoded text which was incorrectly decoded as latin-1.

    >>> s = '\xE4\xB8\xAD\xE5\x9B\xBD\xE6\xB0\x91\xE7\x94\x9F\xE9\x93\xB6\xE8\xA1\x8C'
    >>> s.encode('latin-1').decode('utf-8')
    '中国民生银行'
    

    I can't understand Chinese, but Google translate thinks that says "China Minsheng Bank". Does the output make sense?