Search code examples
mysqllaravelutf8mb4character-set

MySQL: 1366 Incorrect string value: '\xE0' in utf8mb4 column


I'm getting errors when certain characters are being added to a table... even when the column is has utf8mb4 character set. For example:

SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xE0' for column 'surname'

The data in question is: SEGAL‡ (note the double dagger)

Is this character beyond even 4 byte UTF8 or is the collation causing the issue? Or is it something else?

Screenshot showing character set and collation of the column:

screenshot showing utf8mb4 character set

It's a Laravel 8 app and the MySQL connection is configured to the following:

'charset' => 'utf8mb4',
'collation' => 'utf8mb4_unicode_ci',

Looking at the CSV file in PHPStorm, non-ASCII characters are displayed as . I've tried explicitly setting the file encoding to UTF-8 in PHPStorm (with and without BOM).

If I open the CSV in Excel then the non-ASCII characters display correctly. Confused.

Update

Examining the CSV in a HEX editor shows that a character like ä is stored as a single byte (8A). When this CSV is opened in Excel it correctly shows ä, but in everything else it shows .

I don't know what character encoding Excel is using, as this character should be typically stored as E4 when using a single byte, or C3 A4 in UTF-8.


Solution

  • Double dagger is hex E2 80 A1 (only 3 bytes) when encoded in CHARACTER SET utf8 or utf8mb4. It is also available in latin1 as hex 87.

    C3A4 [ä]LATIN SMALL LETTER A WITH DIAERESIS (in utf8/utf8mb4) or E4 in latin1.

    Please provide SHOW VARIABLES LIKE 'char%';

    See "black diamond" in Trouble with UTF-8 characters; what I see is not what I stored for the likely cause.

    The CHARACTER SET is the "encoding" and is relevant; the COLLATION is how characters compare (eg case-folding) and is not relevant.

    Nothing in the Question hints of a BOM (http://mysql.rjweb.org/doc.php/charcoll#bom_byte_order_mark) EF BB BF at the beginning of some UTF-8-encoded files.

    Do not use ucs-2 (UTF-16) unless you happen to have a file encoded that way. Still, I would declare the tables as utf8, not ucs-2.

    More

    I plugged 8AE0E8 into a handy script (that does CONVERT(CONVERT(UNHEX('8ae0e8') USING %s) USING utf8mb4) and got:

    ascii, big5, binary, cp932, eucjpms, euckr,
        gb18030, gb2312, gbk, sjis, ujis,
        utf8mb3, utf8mb4   ---invalid in these charsets
                                         swe7   3  3 '???'
                                       cp1257   5  3 '?ąč'
                                       cp1256   5  3 '?àè'
                                      geostd8   5  3 '?ჭ?'
                                       latin7   6  3 'ąč'
                                 dec8, latin5   6  3 'àè'
                                          hp8   6  3 'ÁÒ'
                                        macce   6  3 'äŗŤ'
                                        cp850   6  3 'èÓÞ'
                                      keybcs2   6  3 'ĹαΦ'
                                        cp852   6  3 'ŐÓŔ'
                                       latin2   6  3 'ŕč'
                                       latin1   6  3 'Šàè'
                                       cp1250   6  3 'Šŕč'
                                        greek   6  3 'ΰθ'
                                        cp866   6  3 'Крш'
                                       cp1251   6  3 'Љаи'
                                     armscii8   6  3 'ՈՌ'
                                       hebrew   6  3 'אט'
                                 koi8r, koi8u   7  3 '┼ЮХ'
                                     macroman   7  3 'ä‡Ë'
                                       tis620   8  3 'เ่'
    

    It looks like macroman is the charset. I think that is specific to Apple.