Search code examples
mysqlutf-8special-charactersmojibake

How to clean data with special characters in MySQL


How can one clean data that looks like this Réation, l’Oreal to look like this R'action and L'Oreal respectively in MySQL?


Solution

  • That looks like an example of "double encoding". It is where the right hand was talking utf8, but the left hand was listening for latin1. See Trouble with UTF-8 characters; what I see is not what I stored and See also http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases .

    Réation -> Réation after undoing the double-encoding.

    Yet you say R'action -- I wonder if you were typing é as e' or 'e?? I'm also going to assume you meant L’Oreal?? (Note the 'right single quote mark' instead of 'apostrophe'.)

    First, we need to verify that it is actually an ordinary double-encoding.

    SELECT col, HEX(col) FROM ... WHERE ...
    

    should give you this for the hex for Réation:

    52  E9       6174696F6E  -- latin1 encoding
    52 C3A9      6174696F6E  -- utf8 encoding
    52 C383 C2A9 6174696F6E  -- double encoding
    

    (Ignore the spacing.)

    If you got the third of those proceed with my Answer. If you get anything else, STOP! -- the problem is more complex than I thought.

    Now, see if the double-encoding fix will fix it (before fixing it):

    SELECT col, CONVERT(BINARY(CONVERT(CONVERT(
                      BINARY(CONVERT(col USING latin1)) USING utf8mb4)
                          USING latin1)) USING utf8mb4)
         FROM tbl;
    

    You need to prevent it from happening and fix the data. Some of the following is irreversible; test it on a copy of the table!

    Your case is: CHARACTER SET latin1, but have utf8/utf8mb4 bytes in it; leave bytes alone while fixing charset:

    First, let's assume you have this declaration for tbl.col:

    col VARCHAR(111) CHARACTER SET latin1 NOT NULL
    

    Then to convert the column without changing the bytes:

    ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
    ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;
    

    Note: If you start with TEXT, use BLOB as the intermediate definition. (Be sure to keep the other specifications the same - VARCHAR, NOT NULL, etc.)

    Do that for each column in each table with the problem.

    (In this discussion I don't distinguish between utf8mb4 and utf8. Most text is quite happy with either; Emoji and some Chinese need utf8mb4, not just utf8.)

    from Comment

            CONVERT(UNHEX('C38EC2B2') USING utf8mb4) = 'β'  (Greek beta)
    CONVERT(CONVERT(UNHEX('C38EC2B2') USING latin1) USING utf8mb4) = 'β'
    

    My conclusion: First you had some misconfiguration. Then you applied one or more wrong fixes. You now have such a mess that I dare not try to help you unravel it. That is, the mess is on beyond simply "double encoding".

    If possible, start over, being sure that some test data gets stored correctly before adding more data. If the data is bad not try to fix the data; back off and start over again. See the "best bractice" in "Trouble..." for getting set up correctly. I'll be around to help you interpret whether the hex you see in the tables is correct.