mysql python-unicode mysql-connector-python

MySQL outputs non-latin unicode as 三

I'm working with Internet Speculative Fiction Database downloadable MySQL database. This query:

SELECT title_title
FROM titles
WHERE title_id = 1779018;

returns 三体. It should return "三体", the Chinese characters for title of the book, The Three-Body Problem. The output is similar for all non-latin unicode text, and some extended latin characters appear as squares. This happens both in the MySQL terminal, and when I use the python MySQL Connector:

import mysql.connector

db_params = dict(
    host="localhost",
    user="root",
    password="",
    database="isfdb"
)
conn = mysql.connector.connect(**db_params)
cur = conn.cursor()

cur.execute("""
    SELECT title_title
    FROM titles
    WHERE title_id = 1779018;
    """
)
unicode_book_title = cur.fetchone()[0]
conn.close()
print(unicode_book_title)

I've tried making changes to the character_set* variables in terminal, but I can't find a combination that works in this case. The current settings are:

mysql> show variables like '%character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

My ultimate goal is to use the python connector for this, so a solution that explicitly sets the character set in the script and doesn't depend on system settings is preferable.

The ISFDB page for this record displays the characters correctly, which makes me think the data was stored correctly, but something is going wrong with the retrieval:

http://www.isfdb.org/cgi-bin/title.cgi?1779018

This can be reproduced by downloading the database from:

http://www.isfdb.org/wiki/index.php/ISFDB_Downloads

I'm using the 5.5-compatible 2021-04-24 download. The setup instructions are here:

http://www.isfdb.org/wiki/index.php/ISFDB:MySQL_Only_Setup

System Information:

mysql Ver 14.14 Distrib 5.7.33, for Linux (x86_64) using EditLine wrapper
mysql.connector.__version__ is 2.2.2b1
Python 3.8.9
Ubuntu 16.04.7 LTS (xenial)

Solution

After more research, I figured out this database doesn't actually store these characters with a unicode encoding, it uses latin encoding and the character's "numeric character reference". In Python 3, you can convert to unicode like:

from html import unescape

print(unescape(unicode_book_title))

This returns "三体". The regular latin characters are left the same for other titles. In Python 2, the HTMLParser package can be used. See: Convert numeric character reference notation to unicode string

MySQL outputs non-latin unicode as &#19977;

MySQL outputs non-latin unicode as 三