Search code examples
pythonmysqlunicode

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014'


I'm getting this error UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014'

I'm trying to load lots of news articles into a MySQLdb. However I'm having difficulty handling non-standard characters, I get hundreds of these errors for all sorts of characters. I can handle them individually using .replace() although I would like a more complete solution to handle them correctly.

ubuntu@ip-10-0-0-21:~/scripts/work$ python test_db_load_error.py
Traceback (most recent call last):
  File "test_db_load_error.py", line 27, in <module>
    cursor.execute(sql_load)
  File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 157, in execute
    query = query.encode(charset)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014' in position 158: ordinal not in range(256)

My script;

import MySQLdb as mdb
from goose import Goose
import string
import datetime

host = 'rds.amazonaws.com'
user = 'news'
password = 'xxxxxxx'
db_name = 'news_reader'
conn = mdb.connect(host, user, password, db_name)

url = 'http://www.dailymail.co.uk/wires/ap/article-3060183/Andrew-Lesnie-Lord-Rings-cinematographer-dies.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490'
g = Goose()
article = g.extract(url=url)
body = article.cleaned_text
body = body.replace("'","`")
load_date = str(datetime.datetime.now())
summary = article.meta_description
title = article.title
image = article.top_image

sql_load = "insert into articles " \
        "    (title,summary,article,,image,source,load_date) " \
        "     values ('%s','%s','%s','%s','%s','%s');" % \
        (title,summary,body,image,url,load_date)
cursor = conn.cursor()
cursor.execute(sql_load)
#conn.commit()

Any help would be appreciated.


Solution

  • If your database is actually configured for Latin-1, then you cannot store non-Latin-1 characters in it. That includes U+2014, EM DASH.

    The ideal solution is to just switch to a database configured for UTF-8. Just pass charset='utf-8' when initially creating the database, and every time you connect to it. (If you already have existing data, you probably want to use MySQL tools to migrate the old database to a new one, instead of Python code, but the basic idea is the same.)

    However, sometimes that isn't possible. Maybe you have other software that can't be updated, requires Latin-1, and needs to share the same database. Or maybe you've mixed Latin-1 text and binary data in ways that can't be programmatically unmixed, or your database is just too huge to migrate, or whatever. In that case, you have two choices:

    • Destructively convert your strings to Latin-1 before storing and searching. For example, you might want to convert an em dash to -, or to --, or maybe it's not all that important and you can just convert all non-Latin-1 characters to ? (which is faster and simpler).

    • Come up with an encoding scheme to smuggle non-Latin-1 characters into the database. This means some searches become more complicated, or just can't be done directly in the database.