Search code examples
pythonpython-3.xcsvunicodechinese-locale

UnicodeDecodeError when reading a CSV file with Chinese characters


I'm trying to index a Chinese csv as document in Elasticsearch. The data in the CSV starts with the following bytes:

b'Chapter,Content,Score\r\n1.1.1,\xacO\xa7_\xa4w\xc5\xe7\xc3\xd2\xab~\xbd\xe8\xa8t\xb2\xce\xa9\xd2\xbb\xdd\xaa\xba\xa6U\xb6\xb5\xba\xde\xa8\xee\xacy\xb5{\xa1H,1\r\n1.1.2,\xab~\xbd\xe8\xba\xde\xb2z\xa8t\xb2\xce\xacO\xa7_\xb2\xc5\xa6XISO\xbc\xd0\xb7\xc7\xaa\xba\xadn\xa8D\xa1H,1\r\n'

And the code is simple as below

import csv
import json
import pandas as pd
from elasticsearch import Elasticsearch
es=Elasticsearch("https://xxx.us-east-1.es.amazonaws.com/")
from elasticsearch import helpers
import codecs
def csv_reader(file_name):
es = Elasticsearch("https://xxx.us-east-1.es.amazonaws.com/")
with codecs.open(file_name, 'r', 'utf-8') as outfile:
    reader = csv.DictReader(outfile)
    helpers.bulk(es, reader, index="checklist", doc_type="quality")
if __name__ == "__main__":
with open('checklist1.csv') as f_obj:
    csv_reader('checklist1.csv')

And then error message below:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 0: invalid start byte

Solution

  • The file is not UTF8-encoded, which is pretty clear from the error. Opening the csv with an editor suggested that it might be latin2, which is clearly wrong because that doesn't include Chinese characters. Sure enough, using that encoding "works" (doesn't raise an error) but is gibberish:

    Chapter,Content,Score
    1.1.1,ŹO§_¤wĹçĂŇŤ~˝č¨t˛ÎŠŇťÝŞşŚUśľşŢ¨îŹyľ{ĄH,1
    

    Looking at the standard encodings shipping with python there's big5 and big5hkscs which are for Traditional Chinese. Both of which give the same result when printed:

    Chapter,Content,Score
    1.1.1,是否已驗證品質系統所需的各項管制流程?,1
    

    Whether that makes any sense can only be answered by someone who speaks Chinese, but the fact that the conversion succeeded without errors is a bit promising.