Search code examples
mysqlunicode

What's the difference between utf8_unicode_ci and utf8mb4_0900_ai_ci


What is the difference between utf8mb4_0900_ai_ci and utf8_unicode_ci database text coding in mysql (especially in terms of performance) ?

Update:

There are similar differences between utf8mb4_unicode_ci and utf8mb4_0900_ai_ci?


Solution

    • The encoding is the same. That is, the bytes look the same.
    • The character set is different. utf8mb4 has more characters.
    • The collation (how comparisions are done) is different.
    • The perfomance is different, but it rarely matters.

    utf8_unicode_ci implies the CHARACTER SET utf8, which includes only the 1-, 2-, and 3-byte UTF-8 characters. Hence it excludes most Emoji and some Chinese characters.

    utf8mb4_unicode_ci implies the CHARACTER SET utf8mb4 is the corresponding COLLATION for the 4-byte CHARACTER SET utf8mb4.

    The Unicode organization has been evolving the specification over the years. Here are the mappings from its "versions" to MySQL Collations:

    4.0   _unicode_
    5.2.0 _unicode_520_ (Unicode 2009; MySQL GA 5.6 2013)
    9.0   _0900_
    14.0  _uca1400_ai_ci etc.  as/ai and cs/ci (MariaDB-10.10, not MySQL)
    

    Most of the differences will be in areas that most people never encounter. One example: At some point, a change allowed Emoji to be distinguished and ordered in some manner.

    The suffix (MySQL doc):

    _bin      -- just compare the bits; don't consider case folding, accents, etc
    _ci       -- explicitly case insensitive (A=a) and implicitly accent insensitive (a=á)
    _ai_ci    -- explicitly case insensitive and accent insensitive
    _as (etc) -- accent-sensitive (etc)
    

    Performance:

    _bin         -- simple, fast
    _general_ci  -- fails to compare multiple letters; eg ss=ß, so somewhat fast
    ...          -- slower
    _900_        -- (8.0) much faster because of a rewrite
    

    However: The speed of collation is usually the least of the performance issues in queries. INDEXes, JOINs, subqueries, table scans, etc are much more critical to performance.