Search code examples
mysqlutf-8utf8mb4

mysql utf8mb4_unicode_ci cause unique key collision


I have a table like this

CREATE TABLE `mb1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(30) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `u_name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

And I insert tow rows

insert into mb1(name) values('K'),('K');

Note, the second K is unicode character

+------+-----------+
| name | hex(name) |
+------+-----------+
| K    | 4B        |
| K   | EFBCAB    |
+------+-----------+

Why do they cause unique key collision? Aren't they different character in utf8mb4?

After remove COLLATE utf8mb4_unicode_ci, the problem disappeared.


Solution

  • Why do they cause unique key collision? Aren't they different character in utf8mb4?

    You are missing the point about CHARACTER SET and COLLATION. A CHARACTER SET is a collection of different characters. A COLLATION says whether to treat the characters as equal -- think A and a -- different characters, but treated for ORDER BY and WHERE =, etc as being the same.

    mysql> SELECT 'K'='K' COLLATE utf8_unicode_ci;
    +-----------------------------------+
    | 'K'='K' COLLATE utf8_unicode_ci  |
    +-----------------------------------+
    |                                 1 |
    +-----------------------------------+
    

    So in utf8_unicode_ci (or utf8mb4_unicode_ci), those two characters are considered to be "equal".

    "Equal" is the test for UNIQUE keys.

    Set the COLLATION for the column to whatever makes sense for you.

    • utf8mb4_unicode_ci for good 'real life' comparisons, apparently including this one. K=k=Ķ=ķ
    • utf8mb4_unicode_ci for more simple-minded comparisons. In particular no 2-character combinations match 1-character encodings. Case folding and accent stripping does occur. K=k=Ķ=ķ
    • utf8mb4_bin blindly checks the bits. No case folding, etc. K k Ķ ķ are all unequal.

    utf8mb4_latvian_ci is a little different: K=k but not equal to Ķ=ķ . There are other specialized collations for other languages (mostly Western European).

    Your is called "FULLWIDTH LATIN CAPITAL LETTER K", so it is quite reasonable that it compare equal to Latin K.