Search code examples
phpmysqlcharacter-encodingentity-framework-6utf8mb4

UTF8 data stored from a Latin1 Connection, how to check if my data is OK?


Latin1 Connection, instead of UTF8

I've had reports recently from those using my website, that they are unable to create content in Chinese or Arabic. This led me to try creating content on my site using Chinese characters and I have observed that the data being stored for these characters is a question mark ?.

I realised from other questions and articles that I've read that I'm probably in 'character set hell'.

It seems that I've been connecting to the database using Entity Framework using a Latin1 connection, as this is the default for MySQL, but the columns in my database are UTF8.

How is my data encoded?

From what I have read of the process, I think that the data in my database is UTF8 data from the browser, encoded as Latin1 during transit to the database which is then encoded again as UTF8 just before it is stored in the database.

I have been using a PHP script to determine whether the data comes out correctly when I set my connection to use UTF8 - and when I select data from my database, it doesn't look like there is any difference.

  <!DOCTYPE html>
  <html>
  <head>
   <meta charset="utf-8" />
  </head>
  <body>
    <?php 
    
    //
    // Make the connection to the database
    //
    $link = mysqli_connect('localhost','root', '', 'mydatabase');

    if (!$link) { 
        die('Could not connect to MySQL: ' . mysql_error()); 
    }

    // Set connection character set to UTF8
    $link->set_charset('utf8');
    
    echo '<p>Connection OK</p>';
    
    //
    // Request the string from the database
    //
    $result = $link->query("SELECT questiontext FROM question WHERE id = 101");
    
    $row = $result->fetch_assoc();
    
    // Display the data
    echo "Result: " . $row['questiontext'] . '<br/>';
    
    mysqli_close($link);
    
    ?>
  </body>
  </html>
  • If I request a string of what looks like [A-Z 0-9] characters using a UTF8 connection to existing data, using this PHP script, it displays on the screen just as it did over the Latin1 connection when using Entity Framework, I can't tell the difference. There doesn't appear to be any issue with the data.
  • If I request a string of Chinese characters that appear to have been entered into the database as question marks, it displays as question marks when retrieved over a UTF8 connection.

I expected that when I connected to the database using a UTF8 connection, that the data would be displayed as garbage, as I was previously using a Latin1 connection - but it isn't.

I used Entity Framework to interrogate the MySql variables that were being used for the connection before and after adding CharSet=utf8; to my connection string. Hopefully can give you an idea about how the connection was being established before and how it is now:

Connection before:

Connection before

Connection with connection string charset updated:

Connection after connection string update

How can I determine if the data in the database is encoded incorrectly, whether it is Latin1 data encoded as UTF8 so that I can decide if I can just change my connection string to use UTF8 and everything will work OK?

Update

I've been experimenting by switching the connection type between UTF8 and Latin1 and these are my findings...

If I set my connection type to latin1 and output the characters, I end up getting something like this:

Tu es dans une �le d�serte

HEX (bin2hex): 54752065732064616e7320756e6520 ee 6c652064 e9 7365727465203a

If I set my connection to utf8:

Tu es dans une île déserte

HEX (bin2hex): 54752065732064616e7320756e6520 c3ae 6c652064 c3a9 7365727465203a

(bold and spacing added by Rick James)

When using a UTF8 connection, there aren't any dodgy looking characters at all - only when I set my connection type to latin1. This leads me to believe that the encoding of my data is OK, presumably it's just straight forward UTF8.

I can only decipher from this, that Entity Framework has been communicating over a UTF8 connection all along, but I don't know how I can confirm that the data is stored correctly.


Solution

  • For Chinese, you need to tell MySQL to use utf8mb4, not just utf8.

    When trying to use utf8/utf8mb4, if you see Question Marks (regular ones, not black diamonds) (? is hex 3F),

    • The bytes to be stored are not encoded as utf8. Fix this.
    • The column in the database is CHARACTER SET utf8mb4. Fix this.
    • Also, check that the connection during reading is utf8mb4.

    新浪新闻 is Mojibake for 新浪新闻

    When trying to use utf8/utf8mb4, if you see Mojibake, check the following. This discussion also applies to Double Encoding, which is not necessarily visible.

    • The bytes to be stored need to be utf8-encoded.
    • The connection when INSERTing and SELECTing text needs to specify utf8mb4. (set_charset)
    • The column needs to be declared CHARACTER SET utf8mb4. (Check with SHOW CREATE TABLE.)
    • HTML should start with <meta charset=UTF-8>. (You did this.)

    To verify, do SELECT col, HEX(col) FROM .... If the hex output for is E696B0, then it is correctly encoded with utf8/utf8mb4. If you get C3A6E28093C2B0, it is "double-encoded". Usually if the hex begins with E or F, it is probably correctly encoded. Also, the hex for a single Chinese character will be 6 or 8 long in all cases. Reference.