Search code examples
pythoncsvunicodeutf-16le

utf-16-le BOM csv files


I'm downloading some CSV files from playstore (stats etc) and want to process with python.

cromestant@jumphost-vpc:~/stat_dev/bime$ file -bi stats/installs/*
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le

As you can see they are utf-16le.

I have some code on python 2.7 that works on some files and not on others:

import codecs
.
.
fp =codecs.open(dir_n+'/'+file_n,'r',"utf-16")
 for line in fp:
  #write to mysql db

This works until:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 10: ordinal not in range(128)

What is the proper way to do this? I've seen "re encode" use cvs module etc. but csv module does not handle encoding by itself, so it seems overkill for just dumping to a database


Solution

  • Have you tried codecs.EncodedFile?

    with open('x.csv', 'rb') as f:
        g = codecs.EncodedFile(f, 'utf8', 'utf-16le', 'ignore')
        c = csv.reader(g)
        for row in c:
            print row
            # and if you want to use unicode instead of str:
            row = [unicode(cell, 'utf8') for cell in row]