I want to use a API from a game and store the player and clan names in a local database. The names can contain all sorts of characters and emoticons. Here are just a few examples I found:
I use python for reading the api and write it into a mysql database. After that, I want to use the names on a Node.js web application.
What is the best way to encode those characters and how can I savely store them in the database, so that I can display them correcly afterwards?
I tried to encode the strings in python with utf-8:
>>> sample = '蛙喜鄉民CLUB'
>>> sample
'蛙喜鄉民CLUB'
>>> sample = sample.encode('UTF-8')
>>> sample
b'\xe8\x9b\x99\xe5\x96\x9c\xe9\x84\x89\xe6\xb0\x91CLUB'
and storing the encoded string in a mysql database with utf8mb4_unicode_ci character set.
When I store the string from above and select it inside mysql workbench it is displayed like this:
蛙喜鄉民CLUB
When I read this string from the database again in python (and store it in db_str
) I get:
>>> db_str
èåéæ°CLUB
>>> db_str.encode('UTF-8')
b'\xc3\xa8\xc2\x9b\xc2\x99\xc3\xa5\xc2\x96\xc2\x9c\xc3\xa9\xc2\x84\xc2\x89\xc3\xa6\xc2\xb0\xc2\x91CLUB'
The first output is total gibberish, the second one with utf-8 looks mostly like the encoded string from above, but with added \xc2
or \xc3
between each byte.
How can I save such strings into mysql, so that I can read them again and display them correctly inside a python script?
Is my database collation utf8mb4_unicode_ci not suitable for such content? Or do I have to use another encoding?
As described by @abarnert in a comment to the question, the problem was that the library used for written the unicode strings didn't know that utf-8 should be used and therefor encoded the strings wrong.
After adding charset='utf8mb4'
as parameter to the mysql connection the string get written correctly in the intended encoding.
All I had to change was
conn = MySQLdb.connect(host, user, pass, db, port)
to
conn = MySQLdb.connect(host, user, pass, db, port, charset='utf8mb4')
and after that my approach described in the question worked flawlessly.
edit: after declaring the charset='utf8mb4'
parameter on the connection object it is no longer necessary to encode the strings, as that gets now already successfully done by the mysqlclient library.