Search code examples
mysqlstringcollationcharacter-setstring-function

MySQL CHAR_LENGTH(str) sometimes produces different outputs for the same String literal when used with different character set introducers


This is weird. As per MySQL developer website -

MySQL CHAR_LENGTH(str)

Returns the length of the string str, measured in characters. A multibyte character counts as a single character. This means that for a string containing five 2-byte characters, LENGTH() returns 10, whereas CHAR_LENGTH() returns 5.

It clearly means that the output of CHAR_LENGTH(str) is independent of character set.

Now as given in -

Character String Literal Character Set and Collation

I can use introducer to SET CHARACTER SET of a String, as in -

SELECT 'abc';
SELECT _latin1'abc';
SELECT _binary'abc';
SELECT _utf8mb4'abc' COLLATE utf8mb4_danish_ci;

In my case, I take the String literal "Hello", use introducer to set it's CHARACTER SET, and use it as argument to the MySQL CHAR_LENGTH(str) function. But weirdly enough, it sometimes produces different outputs when different CHARACTER SETS are used. Example -

SELECT CHAR_LENGTH(_utf8mb4"Hello") AS character_length;
+------------------+
| character_length |
+------------------+
|                5 |
+------------------+

SELECT CHAR_LENGTH(_latin1"Hello") AS character_length;
+------------------+
| character_length |
+------------------+
|                5 |
+------------------+

SELECT CHAR_LENGTH(_ucs2"Hello") AS character_length;
+------------------+
| character_length |
+------------------+
|                3 |
+------------------+

Here, "Hello" has 5 characters. The CHARACTER SETS _utf8mb4 and _latin1 displays the number of characters correctly. But the CHARACTER SET _ucs2 oddly displays the number 0f characters to be 3.

How is this happening?


Solution

  • That is because the language displyed, need 3 charcaters for that word.

    To extend my answer.

    A character conversion is in best cases difficult and should be avoided at all costs.

    First MySQL tries to convert the $ Byte characte4r into 2 Byte characters that ucs2 has.

    The resulting bytes are then used to display the characters, what you see at the end of the Snippet.

    So in a conversion of character sets you don't have letters you have like everywhere in computer science bytes, which with the Representation of lets say H, but the bytes could in another character set have another representation. Further you need always some rules, how you can convert one character set to another, if the number of bytes differ.

    So i also extended the example, to show you that a deterministic convertion actually happen, if you look at the bytes or binary representation you will find the concrete alghorithm,that is used.

    SELECT CHAR_LENGTH(_ucs2"Hello") AS character_length;
    
    | character_length |
    | ---------------: |
    |                3 |
    
    SELECT _ucs2"Hello";
    
    | 䡥汬  |
    | :------ |
    | H敬汯 |
    
    SELECT _ucs2"Hel";
    
    | 䡥  |
    | :--- |
    | H敬 |
    
    SELECT _ucs2"Hell";
    
    | 䡥汬 |
    | :----- |
    | 䡥汬 |
    
    SELECT _ucs2"Hellos";
    
    | 䡥汬潳 |
    | :-------- |
    | 䡥汬潳 |
    

    db<>fiddle here