Search code examples
mysqlutf-8viewddlcharacter-set

MYSQL: Strings in DDL of VIEW turned into garbled symbols several hours later


I recently modified the DDL of a VIEW with Workbench by adding a filter with strings of Chinese characters. The DDL can be saved successfully and worked perfectly right after being modified. However, the modification itself in the DDL became garbled symbols several hours later. The character_set_database and the character set of some columns were originally utf8(utf8_general_ci). After getting the error, I have made all of them utf8mb4 with default collation (utf8mb4_0900_ai_ci). The garbled symbols did change after altering the character set but they are still garbled. Any idea?

Example:
DDL right after modification: where m.NAME not in ('王曉明','張小英')
DDL several hours after modification: where m.NAME not in ('???D?','??\?')

Environment:
MYSQL 8.0.13 Community Server - GPL
Windows 10 professional 64bit (Tradition Chinese; Character set of typing output: UNICODE)
Workbench 8.0.13

Show variables like '%char%'
Result:
character_set_client utf8mb4
character_set_connection utf8mb4
character_set_database utf8mb4
character_set_filesystem binary
character_set_results utf8mb4
character_set_server utf8mb4
character_set_system utf8
character_sets_dir C:\Program Files\MySQL\MySQL Server 8.0\share\charsets\


Solution

  • Please provide a reference to Windows UNICODE -- we need to determine whether it is really "Unicode code points" or actually "UTF-8". If you can provide a hex dump of a little bit of text, I can deduce the answer from that.

    More specifically, 王曉明張小英, encoded in UTF-8 (MySQL's utf8 or utf8mb4) is hex

    E78E8B E69B89 E6988E E5BCB5 E5B08F E88BB1
    

    (Spaces added to separate characters.) For Unicode (MySQL's UCS2):

    738B 66C9 660E 5F35 5C0F 82F1
    

    So, if you get the second hex, then you need to declare that the client is using ucs2, not utf8mb4. Meanwhile, it is quite reasonable for the columns in your tables to be utf8mb4. (And I recommend such.)

    "Character set" versus "collation": utf8mb4 is a "character set"; it determines the "encoding" if the bytes. utf8mb4_0900_ai_ci is a "collation"; it determines the sorting order of characters. You have an encoding problem, not a sorting problem.

    "several hours after modification" -- This reminds me of the derivation of the computer term "bug". Most of the original computers were built of vacuum tubes. Moths were attracted to the lights emanating for the tubes. They sometimes caused hardware problems.

    Hex A4FDBEE5A9FA is the Big5 encoding for 王曉明.