I am trying to parse a csv file which has both english and hindi characters and I am using utf-16. It works fine but as soon as it hits the hindi charatcer it fails. I am at a loss here.
Heres the code -->
import csv
import codecs
csvReader = csv.reader(codecs.open('/home/kuberkaul/Downloads/csv.csv', 'rb', 'utf-16'))
for row in csvReader:
print row
The error that I get is Traceback (most recent call last):
> File "csvreader.py", line 8, in <module>
> for row in csvReader: UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-18: ordinal not in range(128)
> kuberkaul@ubuntu:~/Desktop$
How do I solve this ?
Edit 1:
I tried the solutions and used unicdoe csv reader and now it gives the error :
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
The code is :
import csv
import codecs, io
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
filename = '/home/kuberkaul/Downloads/csv.csv'
reader = unicode_csv_reader(codecs.open(filename))
print reader
for rows in reader:
print rows
As the documentation says, in a big Note near the top:
This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
If you follow link to the example, it shows you the solution: Encode each line to UTF-8 before passing it to csv
. They even give you a nice wrapper, so you can just replace the csv.reader
with unicode_csv_reader
and the rest of your code is unchanged:
csvReader = unicode_csv_reader(codecs.open('/home/kuberkaul/Downloads/csv.csv', 'rb', 'utf-16'))
for row in csvReader:
print row
Of course the print
isn't going to be very useful, as the str
of a list uses the repr
of each element, so you're going to get something like [u'foo', u'bar', u'\u0910\u0911']
…
You can fix that in the usual ways—e.g., print u', '.join(row)
will work if you remember the u
, and if Python is able to guess your terminal's encoding (which it can on Mac and modern linux, but may not be able to on Windows and old linux, in which case you'll need to map an explicit encode
over each column).