I'm importing emails from a Gmail account and storing them in a database. Apparently my PHP logic to convert everything to UTF-8 has failed with one particular email and when I run the query with that email I receive the following MariaDB error:
Incorrect string value: '\x96 conv...'
Here is the minimized version of the SQL query:
INSERT INTO mail (subject) VALUES ('orkut – convite...');
I eventually tracked this down to a header subject string. In my general MariaDB query log the string appears as intended with the en dash (–
): orkut – conv...
. When PHP attempts to run the query I get the error Incorrect string value
as mentioned above however when I manually run the exact same query via HeidiSQL the query executes just fine!
The header is split between the key and value and then the value is processed as follows:
$p = explode(': ', $header, 2);
$s = mb_convert_encoding(trim($p[1]), 'UTF-8', mb_detect_encoding($p[1]))
Now I've tried many things and the closest I could get was cmbuckly's answer on another thread however using $s = iconv('utf-8', 'windows-1252', $s);
did not resolve the problem in this circumstance. I have heaps of other attempts like using utf8_encode($s)
afterwards though I'm just at a loss at this point.
utf8mb4
character set and utf8mb4_unicode_520_ci
database collation.What is causing MariaDB to not work with the query via PHP though allows the query to work via HeidiSQL? Have I properly decoded the string?
https://www.fileformat.info/info/unicode/char/2013/charset_support.htm
The only time an en-dash has a 0x96
byte in it is when a windows cp125x encoding is used.
Figure out which one your source data uses and convert it to UTF8 with
$utf8_data = mb_convert_encoding($data, 'UTF-8', 'cp125x');
Additionally: UTF-8 all the way through