Search code examples
duplicatesmariadbcharactercjk

False duplicate with mariadb and chinese characters


I'm using mariadb Server version: 10.3.37-MariaDB-0ubuntu0.20.04.1 Ubuntu 20.04 on a local machine My database 'nihongo' contains a 'radicalAlt' table which is:

mariadb -e "show full columns from nihongo.radicalAlt;"

+-------------+---------------------+--------------------+------+-----+---------+----------------+---------------------------------+----------------------------------+
| Field       | Type                | Collation          | Null | Key | Default | Extra          | Privileges                      | Comment                          |
+-------------+---------------------+--------------------+------+-----+---------+----------------+---------------------------------+----------------------------------+
| id          | tinyint(3) unsigned | NULL               | NO   | PRI | NULL    | auto_increment | select,insert,update,references |                                  |
| radicalAlt  | char(1)             | utf8mb4_unicode_ci | YES  |     | NULL    |                | select,insert,update,references |                                  |
| radical_id  | int(10) unsigned    | NULL               | YES  |     | NULL    |                | select,insert,update,references | Clé externe pour `radical`.`pk`  |
| strokecount | tinyint(3) unsigned | NULL               | YES  |     | NULL    |                | select,insert,update,references |                                  |
| writing     | varchar(255)        | utf8mb4_unicode_ci | YES  |     | NULL    |                | select,insert,update,references |                                  |
| comment     | varchar(255)        | utf8mb4_unicode_ci | YES  |     | NULL    |                | select,insert,update,references |                                  |
| unicode     | char(8)             | utf8mb4_unicode_ci | YES  | UNI | NULL    |                | select,insert,update,references |                                  |
+-------------+---------------------+--------------------+------+-----+---------+----------------+---------------------------------+----------------------------------+`

The table appears to contain duplicates:

`mariadb -e  "use nihongo; select id, radicalAlt, count(*) as count from radicalAlt group by radicalAlt having count(*)>1;"

+----+------------+-------+
| id | radicalAlt | count |
+----+------------+-------+
| 16 | ⺌         |     2 |
| 24 | ⺝         |     2 |
... and so on...`

But when I select the first "duplicate", I notice there is no duplicate, but two distinct characters (⺌ and ⺍) with distinct unicode values, and that are well displayed in my linux gnome terminal. However, there a considered the same character in the query:

`mariadb  -e  "use nihongo; select * from radicalAlt where radicalAlt='⺌';"
+----+------------+------------+-------------+---------+---------+---------+
| id | radicalAlt | radical_id | strokecount | writing | comment | unicode |
+----+------------+------------+-------------+---------+---------+---------+
| 16 | ⺌         |         42 |           3 | NULL    | NULL    | 2E8C    |
| 83 | ⺍         |         42 |           3 | NULL    | NULL    | 2E8D    |
+----+------------+------------+-------------+---------+---------+---------+`

I ensured that character set and collations are consitent, database and table have the same character set and collation:

`mariadb -e  "show variables like 'character%';"
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8mb4                    |
| character_set_connection | utf8mb4                    |
| character_set_database   | utf8mb4                    |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb4                    |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+`

and:

`mariadb -e  "show variables like 'collation%';"
+----------------------+--------------------+
| Variable_name        | Value              |
+----------------------+--------------------+
| collation_connection | utf8mb4_unicode_ci |
| collation_database   | utf8mb4_unicode_ci |
| collation_server     | utf8mb4_unicode_ci |
+----------------------+--------------------+`

here are my conf files:

/etc/mysql/mariadb.conf.d/50-server.cnf contains:
[...]
`character-set-server  = utf8mb4
collation-server      = utf8mb4_unicode_ci
init-connect='SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci'
skip-character-set-client-handshake
[...]`

/etc/mysql/mariadb.conf.d/50-client.cnf contains:
`[...]
default-character-set = utf8mb4
init_command = "SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci"
[...]`

I understand that data are ok in the base, but I can't select them correctly. What am I doing wrong? What is the way to select correctly my chinese characters? Thank you in advance


Solution

  • In unicode_ci collation these characters might have the same weight (or are defined as ignorable), so you should use (less restrictive) general_ci collation instead:

    # set unicode variables
    set @a=cast(0xE2BA8C as char character set utf8mb4); 
    set @b=cast(0xE2BA8D as char character set utf8mb4);  
    
    select @a,@b;
    +------+------+
    | @a   | @b   |
    +------+------+
    | ⺌   | ⺍   |
    +------+------+
    
    
    # compare @a with @b, using different collations
    
    select @a=@b collate utf8mb4_unicode_ci;
    +----------------------------------+
    | @a=@b collate utf8mb4_unicode_ci |
    +----------------------------------+
    |                                1 |
    +----------------------------------+
    
    MariaDB []> select @a=@b collate utf8mb4_general_ci;
    +----------------------------------+
    | @a=@b collate utf8mb4_general_ci |
    +----------------------------------+
    |                                0 |
    +----------------------------------+