Note: In the following question you may see ? or blocks instead of characters, this is because you don't have the appropriate font. Please ignore this.
I have a table with data structured as follows:
CREATE TABLE `decomposition_dup` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`parent` varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL,
`structure` varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL,
`child` varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
KEY `parent` (`parent`),
KEY `child` (`child`),
KEY `parent_2` (`parent`,`child`)
) ENGINE=InnoDB AUTO_INCREMENT=211929 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
And some example data:
INSERT INTO `decomposition_dup` (`id`, `parent`, `structure`, `child`) VALUES
(154647, '锦', 'a', '钅'),
(154648, '锦', 'a', '帛'),
(185775, '钅', 'd', '二'),
(185774, '钅', 'd', '㇟'),
(21195, '钅', 'd', '𠂉'),
(21178, '⻐', 'd', '乇'),
(21177, '⻐', 'd', '𠂉');
And the charsets are all set correctly:
It is very important to note that:
As you can see, they are different characters. However, in some cases they are treated as the same character.
When I run the following query, it only returns the correct child (i.e. doesn't return the similar-looking but different character child):
SELECT *
FROM decomposition_dup
WHERE parent = '锦'
This is correct behaviour.
However, when I run the following query using 钅 (http://unicode.scarfboy.com/?s=%E9%92%85) it returns both the similar characters:
SELECT *
FROM decomposition_dup
WHERE parent = '钅'
This should only return 185775, 185774 & 21195.
And when I run the following query using ⻐ (http://unicode.scarfboy.com/?s=%E2%BB%90) it also returns both the similar characters:
SELECT *
FROM decomposition_dup
WHERE parent = '⻐'
This should only return 21178 and 21177.
If I replace =
with LIKE
for the broken queries (i.e. Case 2 and Case 3), they return correctly.
For example, the following query is the same as Case 3 but usingLIKE
:
SELECT *
FROM decomposition_dup
WHERE parent LIKE '⻐'
This returns the correct characters but slows down the query.
Is this a bug in MySQL or is there something that I am overlooking when querying for UTF8 extended characters?
If you want them to be the same, set the COLLATION
of the columns to utf8mb4_unicode_ci
or utf8mb4_unicode_520_ci
.
If you want them to be different, use utf8mb4_general_ci
, instead:
mysql> SELECT CONVERT(UNHEX('e99285') USING utf8mb4) =
-> CONVERT(UNHEX('e2bb90') USING utf8mb4) COLLATE utf8mb4_general_ci AS general;
+---------+
| general |
+---------+
| 0 |
+---------+
mysql> SELECT CONVERT(UNHEX('e99285') USING utf8mb4) =
-> CONVERT(UNHEX('e2bb90') USING utf8mb4) COLLATE utf8mb4_unicode_ci AS unicode;
+---------+
| unicode |
+---------+
| 1 |
+---------+
mysql> SELECT CONVERT(UNHEX('e99285') USING utf8mb4) =
-> CONVERT(UNHEX('e2bb90') USING utf8mb4) COLLATE utf8mb4_unicode_520_ci AS unicode_520;
+-------------+
| unicode_520 |
+-------------+
| 1 |
+-------------+