Search code examples
mysqlutf-8utf-16

Converting escaped UTF-16 (JSON entities) back to normal UTF-8 inside MySQL


How do I convert data like that

\u0441\u043e\u0432\u0440\u0435\u043c\u0435\u043d

back to normal UTF8 inside MySQL? Is it even possible?


Solution

  • You can convert between charsets in mysql by reading the data as binary and convert it to a target charset, like this (if you have a table called example and a column called data:

    UPDATE `example` SET data=CONVERT(CONVERT(`data` USING binary) USING utf8);
    

    The JSON \uXXXX entities are in hexadecimal UTF-16 encoding, so if you have a mysql version with utf-16 support, you can convert them to utf-8. The function below shows you how to do it. First, you unhex() those values and then convert from UTF-16 to UTF-8:

    DELIMITER @@
    CREATE FUNCTION Unjson (instring TEXT CHARACTER SET utf8)
    RETURNS TEXT CHARACTER SET utf8
    BEGIN
    
      DECLARE i INT DEFAULT 0;
      DECLARE c VARCHAR(1);
      DECLARE utfstr TEXT CHARACTER SET utf16 DEFAULT "";
      DECLARE outstring TEXT CHARACTER SET utf8 DEFAULT "";
    
      WHILE i < CHAR_LENGTH(instring) DO
        SET i = i + 1;
        SET c = SUBSTRING(instring, i, 1);
        IF c = "\\" THEN
          SET c = SUBSTRING(instring, i + 1, 1);
          IF c = "u" THEN
            SET utfstr = CONCAT(utfstr, UNHEX(SUBSTRING(instring, i + 2, 4)));
            SET i = i + 5;
          END IF;
        ELSE
          IF utfstr != "" THEN
            SET outstring = CONCAT(outstring, CONVERT(utfstr USING utf8));
            SET utfstr = "";
          END IF;
          SET outstring = CONCAT(outstring, c);
        END IF;
      END WHILE;
      IF utfstr != "" THEN
        SET outstring = CONCAT(outstring, CONVERT(utfstr USING utf8));
      END IF;
    
      RETURN outstring;
    END@@
    DELIMITER ;
    

    With this MySQL function in place, you can convert your tables with:

    UPDATE `table_name` SET `column_name`=Unjson(`column_name`);
    

    I'm using MySQL 5.5, but I don't think 5.0 has utf-16 support, so you might have to check your MySQL version...

    (And yeah, I recommend you to make a backup before running this in your production environment...) ;)