Search code examples
mysqlsqldatabasecollation

What COLLATE should i set to use all kind of possible languages?


I have a column called username, I want the user to be able to insert text in japanese, roman, arabic, korean, and everything that is possible, including special chars [https://en.wiktionary.org/wiki/Index:All_languages], what COLLATE should I set on my database and tables?

I'm using utf_general_ci, I'm new so I don't know if this is the best COLLATE for my needs. I need to choose the right COLLATE to avoid sql error, because I will not use preg_replace or a function to replace special chars, I will only use prepared statement to avoid SQL injection and protect by database.


Solution

    • First choice (MySQL 8.0): utf8mb4_0900_ai_ci
    • Second choice (as of 5.6): utf8mb4_unicode_520_ci
    • Third choice (5.5+): utf8mb4_unicode_ci
    • Before 5.5, you can't handle all of Chinese, nor Emoji: utf8_unicode_ci

    The numbers refer to Unicode standards 9.0, 5.20, and (no number) 4.0.

    No collation is good for sorting all languages at the same time. Spanish, German, Turkish, etc, have quirks that are incompatible. The collations above are the 'best' general purpose ones available.

    utf8mb4 handles all characters yet specified by Unicode (including Cherokee, Klingon, Cuneiform, Byzantine, etc.)

    If Portuguese is the focus:

    See https://pt.stackoverflow.com/ and MySQL collation for Portugese .

    Study this for 8.0 or this for pre 8.0 to see which utf8/utf8mb4 collation comes closest to sorting Portuguese 'correctly'. Perhaps utf8mb4_danish_ci or utf8mb4_de_pb_0900_ai_ci would be best.

    (Else go with the 'choices' listed above.)