Search code examples
pythonmysqlunicodeutf-8latin1

String encoding and decoding from possibly latin1 and utf8


I recently stumbled upon a MySQL database that was encoded using Latin1 and was rendering when viewed on a browser question mark symbols. To fix this we changed the encoding of the DB to utf8 and the Collation to utf8_general_ci on all of our tables, but the data already stored was still showing up with question mark symbols, all of the storing and polling of data from mysql to the browser was done by php i made sure utf8 was used on php as well and even ran set names utf8 as many people suggested online, the problem is that now I ended up with weird characters such as Ñ on strings we knew didn't had them.

Examples of data

Stored:

EMMANUEL PE\xc3\u0192\xc2\u2018A GOMEZ PORTUGAL

Rendered:

EMMANUEL PEÑA GOMEZ PORTUGAL

Proper:

EMMANUEL PEÑA GOMEZ PORTUGAL


Stored:

Luis Hern\xe1ndez-Higareda

Rendered:

Luis Hernández-Higareda

Proper:

Luis Hernández-Higareda


Stored:

Teresa de Jes\xc3\u0192\xc2\xbas Galicia G\xc3\u0192\xc2\xb3mez

Rendered:

Teresa de Jesús Galicia Gómez

Proper:

Teresa de Jesús Galicia Gómez


Stored:

DR. JOS\xc3\u0192\xc2\u2030 ABEN\xc3\u0192\xc2\x81MAR RIC\xc3\u0192\xc2\x81RDEZ GARC\xc3\u0192\xc2\x8dA

Proper:

DR. JOSÉ ABENÃÂMAR RICÃÂRDEZ GARCÃÂA

Currently I'm using python to get the data from the DB, I'm trying to normalize to unicode utf8 but I'm really lost, thats as far as I'm getting here, I need to convert what currently shows up as weird characters to readable text as shown above.

what am I missing here? is the data on unrepairable?

Functions https://gist.github.com/2649463

Note: of all of the examples there's 1 that is properly rendering (left there so consideration is taken if any advice is given on how to fix this )


Solution

  • If you attempted to insert characters which cannot be represented in latin1 into columns stored under that character encoding, those characters would have been irreversibly replaced with ? - information has been lost: your only option is to reinsert/update the data now that the column is stored in utf8.

    However, some of the data in your question doesn't make a great deal of sense. For example:

    Stored:

    EMMANUEL PE\xc3\u0192\xc2\u2018A GOMEZ PORTUGAL

    Are you trying to show the bytes or the characters that are currently stored? Either way, one of the \u or \x escape codes is meaningless.

    You say that the original data was encoded as latin1; in that character set, the Ñ character is encoded as 0xd1. You say that you then converted the data to utf8, which would have changed the encoding of that character to the two-byte sequence 0xc391 (which explains the \xc3 you show as being stored, above; however, it is not clear how the second byte 0x91 turned into the sequence \u0192\xc2\u2018 in your snippet*).

    I suspect that the data has in fact been through some further conversions, perhaps between that which is currently stored and whatever means you are using to view such storage. You would be well advised to first ascertain exactly what is stored within your database:

    SELECT HEX(my_column) FROM my_table WHERE ...
    

    Once that has been determined, you will be better able to understand what conversions (if any) need to be employed on your stored data to make it utf8 and what undesirable conversions (if any) are taking place during storage and retrieval operations.


    * Having now read Thanasis Petsas's answer, I realise he's figured out that you appear to have decoded properly utf8-encoded strings as latin1, encoded the resulting characters using utf8 and then decoded those bytes as latin1 again. This does indeed produce the character sequences you've shown, but it is still nevertheless necessary to understand what is actually stored and what is due to conversions during retrieval.