I'm working with Internet Speculative Fiction Database downloadable MySQL database. This query:
SELECT title_title
FROM titles
WHERE title_id = 1779018;
returns 三体
. It should return "三体", the Chinese characters for title of the book, The Three-Body Problem. The output is similar for all non-latin unicode text, and some extended latin characters appear as squares. This happens both in the MySQL terminal, and when I use the python MySQL Connector:
import mysql.connector
db_params = dict(
host="localhost",
user="root",
password="",
database="isfdb"
)
conn = mysql.connector.connect(**db_params)
cur = conn.cursor()
cur.execute("""
SELECT title_title
FROM titles
WHERE title_id = 1779018;
"""
)
unicode_book_title = cur.fetchone()[0]
conn.close()
print(unicode_book_title)
I've tried making changes to the character_set*
variables in terminal, but I can't find a combination that works in this case. The current settings are:
mysql> show variables like '%character_set%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
My ultimate goal is to use the python connector for this, so a solution that explicitly sets the character set in the script and doesn't depend on system settings is preferable.
The ISFDB page for this record displays the characters correctly, which makes me think the data was stored correctly, but something is going wrong with the retrieval:
http://www.isfdb.org/cgi-bin/title.cgi?1779018
This can be reproduced by downloading the database from:
http://www.isfdb.org/wiki/index.php/ISFDB_Downloads
I'm using the 5.5-compatible 2021-04-24 download. The setup instructions are here:
http://www.isfdb.org/wiki/index.php/ISFDB:MySQL_Only_Setup
System Information:
mysql.connector.__version__
is 2.2.2b1After more research, I figured out this database doesn't actually store these characters with a unicode encoding, it uses latin encoding and the character's "numeric character reference". In Python 3, you can convert to unicode like:
from html import unescape
print(unescape(unicode_book_title))
This returns "三体". The regular latin characters are left the same for other titles. In Python 2, the HTMLParser package can be used. See: Convert numeric character reference notation to unicode string