Search code examples
mysqlsqlemojiutf8mb4

How can I search by emoji in MySQL using utf8mb4?


Please help me understand how multibyte characters like emoji's are handled in MySQL utf8mb4 fields.

See below for a simple test SQL to illustrate the challenges.

/* Clear Previous Test */
DROP TABLE IF EXISTS `emoji_test`;
DROP TABLE IF EXISTS `emoji_test_with_unique_key`;

/* Build Schema */
CREATE TABLE `emoji_test` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `string` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
  `status` tinyint(1) NOT NULL DEFAULT '1',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
CREATE TABLE `emoji_test_with_unique_key` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `string` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
  `status` tinyint(1) NOT NULL DEFAULT '1',
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_string_status` (`string`,`status`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

/* INSERT data */
# Expected Result is successful insert for each of these.
# However some fail. See comments.
INSERT INTO emoji_test (`string`, `status`) VALUES ('šŸŒ¶', 1);                   # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('šŸŒ®', 1);                   # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('šŸŒ®šŸŒ¶', 1);                 # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('šŸŒ¶šŸŒ®', 1);                 # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('šŸŒ¶', 1);   # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('šŸŒ®', 1);   # FAIL: Duplicate entry '?-1' for key 'idx_string_status'
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('šŸŒ®šŸŒ¶', 1); # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('šŸŒ¶šŸŒ®', 1); # FAIL: Duplicate entry '??-1' for key 'idx_string_status'

/* Test data */

    /* Simple Table */
SELECT * FROM emoji_test WHERE `string` IN ('šŸŒ¶','šŸŒ®','šŸŒ®šŸŒ¶','šŸŒ¶šŸŒ®'); # SUCCESS (all 4 are found)
SELECT * FROM emoji_test WHERE `string` IN ('šŸŒ¶');                     # FAIL: Returns both šŸŒ¶ and šŸŒ®
SELECT * FROM emoji_test WHERE `string` IN ('šŸŒ®');                     # FAIL: Returns both šŸŒ¶ and šŸŒ®
SELECT * FROM emoji_test;                                              # SUCCESS (all 4 are found)

    /* Table with Unique Key */
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN ('šŸŒ¶','šŸŒ®','šŸŒ®šŸŒ¶','šŸŒ¶šŸŒ®'); # FAIL: Only 2 are found (due to insert errors above)
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN ('šŸŒ¶');                     # SUCCESS
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN ('šŸŒ®');                     # FAIL: šŸŒ¶ found instead of šŸŒ®
SELECT * FROM emoji_test_with_unique_key;                                              # FAIL: Only 2 records found (šŸŒ¶ and šŸŒ®šŸŒ¶)

I'm interested in learning what causes the FAILs above and how I can get around this.

Specifically:

  1. Why do selects for one multibyte character return results for any multibyte character?
  2. How can I configure an index to handle multibyte characters instead of ??
  3. Can you recommend changes to the second CREATE TABLE (the one with a unique key) above in such a way that makes all the test queries return successfully?

Solution

  • You use utf8mb4_unicode_ci for your columns, so the check is case insensitive. If you use utf8mb4_bin instead, then the emoji šŸŒ® and šŸŒ¶ are correctly identified as different letters.

    With WEIGHT_STRING you can get the values that are use for sorting and comparison for the input string.

    If you write:

    SELECT
      WEIGHT_STRING ('šŸŒ®' COLLATE 'utf8mb4_unicode_ci'),
      WEIGHT_STRING ('šŸŒ¶' COLLATE 'utf8mb4_unicode_ci')
    

    Then you can see that both are 0xfffd. In Unicode Character Sets they say:

    For supplementary characters in general collations, the weight is the weight for 0xfffd REPLACEMENT CHARACTER.

    If you write:

    SELECT 
      WEIGHT_STRING('šŸŒ®' COLLATE 'utf8mb4_bin'),
      WEIGHT_STRING('šŸŒ¶' COLLATE 'utf8mb4_bin')
    

    You will get their unicode values 0x01f32e and 0x01f336 instead.

    For other letters like Ƅ, Ɓ and A that are equal if you use utf8mb4_unicode_ci, the difference can be seen in:

    SELECT
      WEIGHT_STRING ('Ƅ' COLLATE 'utf8mb4_unicode_ci'),
      WEIGHT_STRING ('A' COLLATE 'utf8mb4_unicode_ci')
    

    Those map to to the weight 0x0E33

    Ƅ: 00C4  ; [.0E33.0020.0008.0041][.0000.0047.0002.0308] # LATIN CAPITAL LETTER A WITH DIAERESIS; QQCM
    A: 0041  ; [.0E33.0020.0008.0041] # LATIN CAPITAL LETTER A
    

    According to : Difference between utf8mb4_unicode_ci and utf8mb4_unicode_520_ci collations in MariaDB/MySQL? the weights used for utf8mb4_unicode_ci are based on UCA 4.0.0 because the emoji do not appear in there, the mapped weight is 0xfffd

    If you need case insensitive compares and sorts for regular letters along with emoji then this problem is solved using utf8mb4_unicode_520_ci:

    SELECT
      WEIGHT_STRING('šŸŒ®' COLLATE 'utf8mb4_unicode_520_ci'),
      WEIGHT_STRING('šŸŒ¶' COLLATE 'utf8mb4_unicode_520_ci')
    

    there will also get different weights for those emoji 0xfbc3f32e and 0xfbc3f336.