Search code examples
pythonpython-2.7unicodecharacter-encodingnon-ascii-characters

Reliable way of handling non-ASCII characters in Python?


I have a column a spreadsheet whose header contains non-ASCII characters thus:

'Campaign'

If I pop this string into the interpreter, I get:

'\xc3\xaf\xc2\xbb\xc2\xbfCampaign'

The string is one the keys in the rows of a csv.DictReader()

When I try to populate a new dict with with the value of this key:

spends['Campaign'] = 2

I get:

Key Error: '\xc3\xaf\xc2\xbb\xc2\xbfCampaign'

If I print the value of the keys of row, I can see that it is '\xef\xbb\xbfCampaign'

Obviously then I can just update my program to access this key thus:

spends['\xef\xbb\xbfCampaign']

But is there a "better" way of doing this, in Python? Indeed, if the value of this key every changes to contain other non-ASCII characters, what is an all-encompassing way of handling any all non-ASCII characters that may arise?


Solution

  • In general, you should decode a bytestring into Unicode text using the corresponding character encoding as soon as possible on input. And, in reverse, encode Unicode text into a bytestring as late as possible on output. Some APIs such as io.open() can do it implicitly so that your code sees only Unicode.

    Unfortunately, csv module does not support Unicode directly on Python 2. See UnicodeReader, UnicodeWriter in the doc examples. You could create their analog for csv.DictReader or as an alternative just pass utf-8 encoded bytestrings to csv module.