Search code examples
mysqlencodingcharacter-encodingmojibake

recreate encoding mixup in mysql table


I have an sql table where a column has utf8_unicode_ci encoding, but the table itself has latin1_swedish_ci encoding (as reported under Row Statistics in Structure tab of phpMyAdmin).

The PHP webapp that accesses the database displays Japanese text correctly, but inside phpMyAdmin everything is mojibake. The webapp (correctly) displays the Japanese text Xで有名な, but in phpMyAdmin it is Xã¦ã‚™æœ‰å㪠(hex() output is 312E2058C3A3C281C2A6C3A3E2809AE284A2C3A6C593E280B0C3A5C290C28DC3A3C281C2AA).

The app that was used to generate the data in the table is now broken, but I need to add a few new records. How can I recreate the mojibake found in the table?

I tried to reproduce the mojibake with python:

def rev_engineer(utf8):
    mojibake = utf8.encode('utf8').decode('latin1')
    print(mojibake)

rev_engineer('Xで有名な')
# output:    Xã¦ãæåãª
# should be: Xã¦ã‚™æœ‰åãª

This is obviously very similar, but not quite there. I then tried looping through every possible encoding listed in python's documentation, and encoding/decoding every possible combination, but that did not come up with a match, either. Any idea what I'm missing?


Solution

  • To be sure my character will be interpreted as UTF8 sequence

    test> set names utf8 ;
    Query OK, 0 rows affected (0.00 sec)
    

    check that i have 2 bytes for é

    test> select hex(binary('é')) ;
    +-------------------+
    | hex(binary('é')) |
    +-------------------+
    | C3A9              |
    +-------------------+
    1 row in set (0.00 sec)
    

    checking i have the same value

    test ]> select convert(binary(convert(convert(unhex('312E2058C3A3C281C2A6C3A3E2809AE284A2C3A6C593E280B0C3A5C290C28DC3A3C281C2AA') using utf8 ) using latin1 )) using utf8 );
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | convert(binary(convert(convert(unhex('312E2058C3A3C281C2A6C3A3E2809AE284A2C3A6C593E280B0C3A5C290C28DC3A3C281C2AA') using utf8 ) using latin1 )) using utf8 ) |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | 1. Xで有名な                                                                                                                                          |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
    1 row in set (0.00 sec)
    

    by copy/paste the output string , i can reverse the process

    test > select hex(convert (convert(binary('1. Xで有名な  ') using latin1 ) using utf8 )) ;
    +---------------------------------------------------------------------------------+
    | hex(convert (convert(binary('1. Xで有名な') using latin1 ) using utf8 )) |
    +---------------------------------------------------------------------------------+
    | 312E2058C3A3C281C2A6C3A3E2809AE284A2C3A6C593E280B0C3A5C290C28DC3A3C281C2AA      |
    +---------------------------------------------------------------------------------+
    1 row in set (0.00 sec)
    

    If you have a few rows to insert you insert your rows with pphpmyadmin and if it does not work directly via the command mysql .

    If you want to use python you can use this module : https://pypi.org/project/mysql-latin1-codec/