I'm trying to index a Chinese csv as document in Elasticsearch. The data in the CSV starts with the following bytes:
b'Chapter,Content,Score\r\n1.1.1,\xacO\xa7_\xa4w\xc5\xe7\xc3\xd2\xab~\xbd\xe8\xa8t\xb2\xce\xa9\xd2\xbb\xdd\xaa\xba\xa6U\xb6\xb5\xba\xde\xa8\xee\xacy\xb5{\xa1H,1\r\n1.1.2,\xab~\xbd\xe8\xba\xde\xb2z\xa8t\xb2\xce\xacO\xa7_\xb2\xc5\xa6XISO\xbc\xd0\xb7\xc7\xaa\xba\xadn\xa8D\xa1H,1\r\n'
And the code is simple as below
import csv
import json
import pandas as pd
from elasticsearch import Elasticsearch
es=Elasticsearch("https://xxx.us-east-1.es.amazonaws.com/")
from elasticsearch import helpers
import codecs
def csv_reader(file_name):
es = Elasticsearch("https://xxx.us-east-1.es.amazonaws.com/")
with codecs.open(file_name, 'r', 'utf-8') as outfile:
reader = csv.DictReader(outfile)
helpers.bulk(es, reader, index="checklist", doc_type="quality")
if __name__ == "__main__":
with open('checklist1.csv') as f_obj:
csv_reader('checklist1.csv')
And then error message below:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 0: invalid start byte
The file is not UTF8-encoded, which is pretty clear from the error. Opening the csv with an editor suggested that it might be latin2
, which is clearly wrong because that doesn't include Chinese characters. Sure enough, using that encoding "works" (doesn't raise an error) but is gibberish:
Chapter,Content,Score
1.1.1,ŹO§_¤wĹçĂŇŤ~˝č¨t˛ÎŠŇťÝŞşŚUśľşŢ¨îŹyľ{ĄH,1
Looking at the standard encodings shipping with python there's big5
and big5hkscs
which are for Traditional Chinese. Both of which give the same result when print
ed:
Chapter,Content,Score
1.1.1,是否已驗證品質系統所需的各項管制流程?,1
Whether that makes any sense can only be answered by someone who speaks Chinese, but the fact that the conversion succeeded without errors is a bit promising.