I'm trying to convert a database to use utf8mb4 instead of utf8. Everything is going fine except one table:
CREATE TABLE `search_terms` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`search_term` varchar(128) NOT NULL,
`time_added` timestamp NULL DEFAULT NULL,
`count` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `search_term` (`search_term`),
KEY `search_term_count` (`count`)
) ENGINE=InnoDB AUTO_INCREMENT=198981 DEFAULT CHARSET=utf8;
Basically all it does is save an entry every time somebody searches something in a form so we can track the number of searches, very simple.
There's a unique index on search_term
because we want to only have one row per search term and instead increment the count value.
However when converting to utf8mb4 I am getting duplicate entry errors. Here is the command I am running:
ALTER TABLE `search_terms` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Looking in the database I can see various examples like this:
fm2012
fm2012
fm2012
In it's current utf8 character set, these are all being treated as unique and exist within the database without ever having an issue with the unique index on search_term
.
But when converting to utf8mb4 they are now being considered equal and throwing an error due to that index.
I can figure out how to merge these together easily enough, but i'm concerned this may be a symptom of a greater underlying problem. I'm not really sure how this has happened or what the consequences may be, so my questions are a bit vague:
Your problem is the change of collation: you're using general_ci
and you're converting to unicode_ci
: general_ci
is quite a simple collation that doesn't know much about unicode, but unicode_ci
does.
The first "f" in your example string is a "Fullwidth Latin Small Letter F" (U+FF46) which is considered equal to "Latin Small Letter F" (U+0066) by unicode_ci
but not by general_ci
.
Normally it's recommended to use unicode_ci
exactly because of its unicode-awareness but you could convert to utf8mb4_general_ci
to prevent this problem.
To prevent this problem in the future, you should normalize your input before saving it in the DB. Normally you'd use NFC, but your case seems to call for NFKC. This should bring all "equivalent" strings to the same form.