Search code examples
mysqlunicodeutf-8character-encodingmojibake

MySQL - Converting ANSI to Unicode


Somehow, data in MySQL database has been converted from Unicode to ANSI and caused a lot of the symbols that are in the system to display incorrectly.

Symbols now display like:

  • –
  • •
  • ₸
  • Ω
  • etc.

I would like to convert it back to Unicode...
I have tried:

  • Re-importing the database as UTF8 character set
  • Using Notepad++ to "convert to UTF8..."
  • Added default charset UTF8 to Apache config...
  • A PHP script that takes all the DBs and tables, copies them, converts the data, then re-creates the original databases.

These methods did not work out... they seem to just leave my data the way it is, but any future attempts to use symbols works fine.
I want to convert these existing misinterpretations back to their original forms!


Hex output of an example text from the DB.

SELECT hex(name) FROM table_name where id = 17;

which is equivalent to: SELECT hex('☼STICKY☼');

OUTPUT: C3A2CB9CC2BC535449434B59C3A2CB9CC2BC


Solution

  • The output you showed looked somewhat like "double encoding" and discussed in http://mysql.rjweb.org/doc.php/charcoll . Please dump a little of the text in hex for confirmation.

    To clarify, you are probably looking at utf8 (not unicode) versus latin1 (not ANSI).

    @Tomas M - In PHP, mysqli_set_charset('utf8') is the proper call, not SET NAMES utf8.

    However, if the data is mangled in the table(s), that call will not help.

    (Edit -- added after HEX by OP)

    mysql> SELECT hex(convert(convert(unhex('C3A2CB9CC2BC') using utf8) using latin1));
    +----------------------------------------------------------------------+
    | hex(convert(convert(unhex('C3A2CB9CC2BC') using utf8) using latin1)) |
    +----------------------------------------------------------------------+
    | E298BC                                                               |
    +----------------------------------------------------------------------+
    mysql> SELECT unhex('E298BC');
    +-----------------+
    | unhex('E298BC') |
    +-----------------+
    | ☼               |
    +-----------------+
    

    Is '☼' what you were expecting before 'STICKY'? You had "double encoding"; it took 2 steps to decode it.