python python-2.7 character-encoding mojibake

How to detect if a string is already utf8-encoded?

I have some strings like this:

u'ThaÃÂ¯lande'

This was "Thaïlande" and I dont know how it's been encoded, but I need to bring it back to "Thaïlande", then URL-encode it.

Is there a way to guess if a string has already been encoded with Python 2?

Solution

You have what is called a Mojibake. You could use statistical analysis to see if there is a unusual number of Latin-1 characters in there in a combination typical of UTF-8 bytes, or if there are any CP1252-specific characters in there.

There already is a package that does this for you and repairs the damage if a Mojibake is detected: ftfy:

The goal of ftfy is to take in bad Unicode and output good Unicode, for use in your Unicode-aware code.

and

The ftfy.fix_encoding() function will look for evidence of mojibake and, when possible, it will undo the process that produced it to get back the text that was supposed to be there.

Does this sound impossible? It’s really not. UTF-8 is a well-designed encoding that makes it obvious when it’s being misused, and a string of mojibake usually contains all the information we need to recover the original string.