Search code examples
phputf-8character-encodingmariadbutf8mb4

MariaDB query won't run via PHP, works manually with HeidiSQL; en dash encoding issue; proper DB utf8mb4 character set and collation


I'm importing emails from a Gmail account and storing them in a database. Apparently my PHP logic to convert everything to UTF-8 has failed with one particular email and when I run the query with that email I receive the following MariaDB error:

Incorrect string value: '\x96 conv...'

Here is the minimized version of the SQL query:

INSERT INTO mail (subject) VALUES ('orkut – convite...');

I eventually tracked this down to a header subject string. In my general MariaDB query log the string appears as intended with the en dash (): orkut – conv.... When PHP attempts to run the query I get the error Incorrect string value as mentioned above however when I manually run the exact same query via HeidiSQL the query executes just fine!

The header is split between the key and value and then the value is processed as follows:

$p = explode(': ', $header, 2);
$s = mb_convert_encoding(trim($p[1]), 'UTF-8', mb_detect_encoding($p[1]))

Now I've tried many things and the closest I could get was cmbuckly's answer on another thread however using $s = iconv('utf-8', 'windows-1252', $s); did not resolve the problem in this circumstance. I have heaps of other attempts like using utf8_encode($s) afterwards though I'm just at a loss at this point.

  • My MariaDB database has the proper utf8mb4 character set and utf8mb4_unicode_520_ci database collation.
  • I am not going to try to use MariaDB to encode/re-encode text, this is explicitly intended as a PHP question.

What is causing MariaDB to not work with the query via PHP though allows the query to work via HeidiSQL? Have I properly decoded the string?


Solution

  • https://www.fileformat.info/info/unicode/char/2013/charset_support.htm

    The only time an en-dash has a 0x96 byte in it is when a windows cp125x encoding is used.

    Figure out which one your source data uses and convert it to UTF8 with

    $utf8_data = mb_convert_encoding($data, 'UTF-8', 'cp125x');
    

    Additionally: UTF-8 all the way through