I recently stumbled upon a MySQL database that was encoded using Latin1 and was rendering when viewed on a browser question mark symbols. To fix this we changed the encoding of the DB to utf8 and the Collation to utf8_general_ci on all of our tables, but the data already stored was still showing up with question mark symbols, all of the storing and polling of data from mysql to the browser was done by php i made sure utf8 was used on php as well and even ran set names utf8 as many people suggested online, the problem is that now I ended up with weird characters such as Ñ on strings we knew didn't had them.
Examples of data
Stored:
EMMANUEL PE\xc3\u0192\xc2\u2018A GOMEZ PORTUGAL
Rendered:
EMMANUEL PEÑA GOMEZ PORTUGAL
Proper:
EMMANUEL PEÑA GOMEZ PORTUGAL
Stored:
Luis Hern\xe1ndez-Higareda
Rendered:
Luis Hernández-Higareda
Proper:
Luis Hernández-Higareda
Stored:
Teresa de Jes\xc3\u0192\xc2\xbas Galicia G\xc3\u0192\xc2\xb3mez
Rendered:
Teresa de Jesús Galicia Gómez
Proper:
Teresa de Jesús Galicia Gómez
Stored:
DR. JOS\xc3\u0192\xc2\u2030 ABEN\xc3\u0192\xc2\x81MAR RIC\xc3\u0192\xc2\x81RDEZ GARC\xc3\u0192\xc2\x8dA
Proper:
DR. JOSÉ ABENÃÂMAR RICÃÂRDEZ GARCÃÂA
Currently I'm using python to get the data from the DB, I'm trying to normalize to unicode utf8 but I'm really lost, thats as far as I'm getting here, I need to convert what currently shows up as weird characters to readable text as shown above.
what am I missing here? is the data on unrepairable?
Functions https://gist.github.com/2649463
Note: of all of the examples there's 1 that is properly rendering (left there so consideration is taken if any advice is given on how to fix this )
If you attempted to insert characters which cannot be represented in latin1
into columns stored under that character encoding, those characters would have been irreversibly replaced with ?
- information has been lost: your only option is to reinsert/update the data now that the column is stored in utf8
.
However, some of the data in your question doesn't make a great deal of sense. For example:
Stored:
EMMANUEL PE\xc3\u0192\xc2\u2018A GOMEZ PORTUGAL
Are you trying to show the bytes or the characters that are currently stored? Either way, one of the \u
or \x
escape codes is meaningless.
You say that the original data was encoded as latin1
; in that character set, the Ñ character is encoded as 0xd1
. You say that you then converted the data to utf8
, which would have changed the encoding of that character to the two-byte sequence 0xc391
(which explains the \xc3
you show as being stored, above; however, it is not clear how the second byte 0x91
turned into the sequence \u0192\xc2\u2018
in your snippet*).
I suspect that the data has in fact been through some further conversions, perhaps between that which is currently stored and whatever means you are using to view such storage. You would be well advised to first ascertain exactly what is stored within your database:
SELECT HEX(my_column) FROM my_table WHERE ...
Once that has been determined, you will be better able to understand what conversions (if any) need to be employed on your stored data to make it utf8
and what undesirable conversions (if any) are taking place during storage and retrieval operations.
* Having now read Thanasis Petsas's answer, I realise he's figured out that you appear to have decoded properly utf8
-encoded strings as latin1
, encoded the resulting characters using utf8
and then decoded those bytes as latin1
again. This does indeed produce the character sequences you've shown, but it is still nevertheless necessary to understand what is actually stored and what is due to conversions during retrieval.