Search code examples
mysqlunicodesql-likecjkutf8mb4

MySQL returns incorrect UTF8 extended characters in some cases only


Note: In the following question you may see ? or blocks instead of characters, this is because you don't have the appropriate font. Please ignore this.

Background

I have a table with data structured as follows:

CREATE TABLE `decomposition_dup` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `parent` varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL,
 `structure` varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL,
 `child` varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL,
 PRIMARY KEY (`id`),
 KEY `parent` (`parent`),
 KEY `child` (`child`),
 KEY `parent_2` (`parent`,`child`)
) ENGINE=InnoDB AUTO_INCREMENT=211929 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

And some example data:

INSERT INTO `decomposition_dup` (`id`, `parent`, `structure`, `child`) VALUES
(154647, '锦', 'a', '钅'),
(154648, '锦', 'a', '帛'),
(185775, '钅', 'd', '二'),
(185774, '钅', 'd', '㇟'),
(21195, '钅', 'd', '𠂉'),
(21178, '⻐', 'd', '乇'),
(21177, '⻐', 'd', '𠂉');

And the charsets are all set correctly:

enter image description here

Problem

It is very important to note that:

As you can see, they are different characters. However, in some cases they are treated as the same character.

Case 1

When I run the following query, it only returns the correct child (i.e. doesn't return the similar-looking but different character child):

SELECT *
FROM decomposition_dup
WHERE parent = '锦'

enter image description here

This is correct behaviour.

Case 2

However, when I run the following query using 钅 (http://unicode.scarfboy.com/?s=%E9%92%85) it returns both the similar characters:

SELECT *
FROM decomposition_dup
WHERE parent = '钅'

enter image description here

This should only return 185775, 185774 & 21195.

Case 3

And when I run the following query using ⻐ (http://unicode.scarfboy.com/?s=%E2%BB%90) it also returns both the similar characters:

SELECT *
FROM decomposition_dup
WHERE parent = '⻐'

enter image description here

This should only return 21178 and 21177.

Case 4

If I replace = with LIKE for the broken queries (i.e. Case 2 and Case 3), they return correctly.

For example, the following query is the same as Case 3 but usingLIKE:

SELECT *
FROM decomposition_dup
WHERE parent LIKE '⻐'

enter image description here

This returns the correct characters but slows down the query.

Question

Is this a bug in MySQL or is there something that I am overlooking when querying for UTF8 extended characters?


Solution

  • If you want them to be the same, set the COLLATION of the columns to utf8mb4_unicode_ci or utf8mb4_unicode_520_ci.
    If you want them to be different, use utf8mb4_general_ci, instead:

    mysql> SELECT CONVERT(UNHEX('e99285') USING utf8mb4) =
        ->        CONVERT(UNHEX('e2bb90') USING utf8mb4) COLLATE utf8mb4_general_ci AS general;
    +---------+
    | general |
    +---------+
    |       0 |
    +---------+
    
    mysql> SELECT CONVERT(UNHEX('e99285') USING utf8mb4) =
        ->        CONVERT(UNHEX('e2bb90') USING utf8mb4) COLLATE utf8mb4_unicode_ci AS unicode;
    +---------+
    | unicode |
    +---------+
    |       1 |
    +---------+
    
    mysql> SELECT CONVERT(UNHEX('e99285') USING utf8mb4) =
        ->        CONVERT(UNHEX('e2bb90') USING utf8mb4) COLLATE utf8mb4_unicode_520_ci AS unicode_520;
    +-------------+
    | unicode_520 |
    +-------------+
    |           1 |
    +-------------+