Search code examples
pythontext-manipulation

Decoding unknown encoded Traditional Chinese character strings using Python


Hi I have a website that is in Traditional Chinese and when I check the site statistics it tell me that the search term for the website is å%8f°å%8d%97 親å­%90é¤%90廳 which obviously makes no sense to me. My question is what is this encoding called? And is there a way to use Python to decode this character string. Thank you.


Solution

  • It is called a mutt encoding; the underlying bytes have been mangled beyond their original meaning and they are no longer a real encoding.

    It was once URL-quoted UTF-8, but now interpreted as latin-1 without unquoting those URL escapes. I was able to un-mangle this by interpreting it as such:

    >>> from urllib2 import unquote
    >>> bytesquoted = u'å%8f°å%8d%97 親å­%90é¤%90廳'.encode('latin1')
    >>> unquoted = unquote(bytesquoted)
    >>> print unquoted.decode('utf8')
    台南 親子餐廳