I have a column a spreadsheet whose header contains non-ASCII characters thus:
'Campaign'
If I pop this string into the interpreter, I get:
'\xc3\xaf\xc2\xbb\xc2\xbfCampaign'
The string is one the keys in the rows
of a csv.DictReader()
When I try to populate a new dict with with the value
of this key:
spends['Campaign'] = 2
I get:
Key Error: '\xc3\xaf\xc2\xbb\xc2\xbfCampaign'
If I print the value of the keys of row, I can see that it is '\xef\xbb\xbfCampaign'
Obviously then I can just update my program to access this key thus:
spends['\xef\xbb\xbfCampaign']
But is there a "better" way of doing this, in Python? Indeed, if the value of this key every changes to contain other non-ASCII characters, what is an all-encompassing way of handling any all non-ASCII characters that may arise?
In general, you should decode a bytestring into Unicode text using the corresponding character encoding as soon as possible on input. And, in reverse, encode Unicode text into a bytestring as late as possible on output. Some APIs such as io.open()
can do it implicitly so that your code sees only Unicode.
Unfortunately, csv
module does not support Unicode directly on Python 2. See UnicodeReader
, UnicodeWriter
in the doc examples. You could create their analog for csv.DictReader
or as an alternative just pass utf-8 encoded bytestrings to csv
module.