Search code examples
pythonmysqlcharacter-encodingutf8mb4

How to encode international strings with emoticons and special characters for storing in database


I want to use a API from a game and store the player and clan names in a local database. The names can contain all sorts of characters and emoticons. Here are just a few examples I found:

  • ⭐💎
  • яαℓαηι
  • نکل
  • 窝猫
  • 鐵擊道遊隊
  • ❤✖❤♠️♦️♣️✖

I use python for reading the api and write it into a mysql database. After that, I want to use the names on a Node.js web application.

What is the best way to encode those characters and how can I savely store them in the database, so that I can display them correcly afterwards?

I tried to encode the strings in python with utf-8:

>>> sample = '蛙喜鄉民CLUB'
>>> sample
'蛙喜鄉民CLUB'
>>> sample = sample.encode('UTF-8')
>>> sample
b'\xe8\x9b\x99\xe5\x96\x9c\xe9\x84\x89\xe6\xb0\x91CLUB'

and storing the encoded string in a mysql database with utf8mb4_unicode_ci character set.

When I store the string from above and select it inside mysql workbench it is displayed like this:

蛙喜鄉民CLUB

When I read this string from the database again in python (and store it in db_str) I get:

>>> db_str
èåéæ°CLUB
>>> db_str.encode('UTF-8')
b'\xc3\xa8\xc2\x9b\xc2\x99\xc3\xa5\xc2\x96\xc2\x9c\xc3\xa9\xc2\x84\xc2\x89\xc3\xa6\xc2\xb0\xc2\x91CLUB'

The first output is total gibberish, the second one with utf-8 looks mostly like the encoded string from above, but with added \xc2 or \xc3 between each byte.

How can I save such strings into mysql, so that I can read them again and display them correctly inside a python script?

Is my database collation utf8mb4_unicode_ci not suitable for such content? Or do I have to use another encoding?


Solution

  • As described by @abarnert in a comment to the question, the problem was that the library used for written the unicode strings didn't know that utf-8 should be used and therefor encoded the strings wrong.

    After adding charset='utf8mb4' as parameter to the mysql connection the string get written correctly in the intended encoding.

    All I had to change was

    conn = MySQLdb.connect(host, user, pass, db, port)
    

    to

    conn = MySQLdb.connect(host, user, pass, db, port, charset='utf8mb4')
    

    and after that my approach described in the question worked flawlessly.

    edit: after declaring the charset='utf8mb4' parameter on the connection object it is no longer necessary to encode the strings, as that gets now already successfully done by the mysqlclient library.