Search code examples
pythonmysqlutf-8python-3.5utf8mb4

How to avoid b' and UTF-8 literals in MySQL using Python 3


I'm a bit of a novice. So I'm trying to parse html pages and put the contents in a column in MySQL, however I can't seem to get the actual foreign characters to appear, such as instead of á I get xc3xa1. My table has utf8mb4 as its character set and collation utf8mb4_unicode_ci. I have the following set-up:

Database_cnx = pymysql.connect(user='XXXX', password='XXXX',
                              host='XXXX',
                              database='XXXX',
                              use_unicode=True,
                              charset='utf8mb4')

article_content = str(row[3].encode("utf-8")).replace("'", "\'").replace("\"", "\'")

q_i = ("INSERT INTO article_items (" + ", ".join(article_table_col_name_new) + ")"
"VALUES ({:d}, \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\")".format(row[0], urlparse(row[1]).netloc, row[1], row[2].replace("\"", "'"), article_content, datetime.fromtimestamp(row[4]).strftime("%Y-%m-%d"), updated)
)

So how can I have it that only the actual article content appears in my column, instead of the b' byte and the utf-8 literals. thanks


Solution

  • The problem is that you're explicitly encoding your string into a UTF-8 bytes, and then turning that UTF-8 bytes into its string representation.

    That's what this code means:

    str(row[3].encode("utf-8"))
    

    If you don't want to do that, just don't do that:

    row[3]
    

    Here's an example that shows what you're doing:

    >>> s = 'à'
    >>> s
    'à'
    >>> s.encode('utf-8')
    b'\xc3\xa0'
    >>> str(s.encode('utf-8'))
    "b'\\xc3\\xa0'"
    

    What you want here is the first one.

    More generally, calling str on a bytes is almost never useful. If you unavoidably have a bytes and you need a str, you get it by calling the decode method. But in this case, you don't unavoidably have a bytes. (I mean, you could write row[3].encode("utf-8").decode("utf-8"), but that would obviously be pretty silly.)


    As a side note—but a very important one—you should not be trying to str.format your values into the SQL string. Just use query parameters. Here's the obligatory xkcd link that explains the security/safety problem, and on top of that, you're making your code much more complicated, and even less efficient.

    In other words, instead of doing this:

    "VALUES ({:d}, \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\", \"{:s}\")".format(row[0], urlparse(row[1]).netloc, row[1], row[2].replace("\"", "'"), article_content, datetime.fromtimestamp(row[4]).strftime("%Y-%m-%d"), updated)
    

    … just do this:

    "VALUES (%s, %s, %s, %s, %s, %s, %s)"
    

    And then, when you later execute the query, pass the arguments—without all that complicated converting to strings and quoting and replacing embedded quotes, just the values as-is—as the arguments to execute.

    db.execute(q_i, (
        row[0], urlparse(row[i]).netloc, row[1], row[2], article_content, 
        datetime.fromtimestamp(row[4]).strftime("%Y-%m-%d"), updated))
    

    In fact, if your next to last column is—or could be—a DATETIME column rather than a CHAR/VARCHAR/TEXT/whatever, you don't even need that strftime; just pass the datetime object.

    And notice that this means that you don't need to do anything at all to article_content. The quote stuff is neither necessary nor a good idea (unless you have some other, app-specific reason that you need to avoid " characters in articles), and the encoding stuff is not solving any problem, but only causing a new one.