Search code examples

Python CSV write to file unreadable in Excel (Chinese characters)

I am trying to performing text analysis on Chinese texts. The program is provided below. I got the result with unreadable characters such as 浜烘皯鏃ユ姤绀捐. And if I change the output file result.csv to result.txt, the characters are correct as 人民日报社论. So what's wrong with this? I can not figure out. I tried several ways including add decoder and encoder.

    # -*- coding: utf-8 -*-
    import os
    import glob
    import jieba
    import jieba.analyse
    import csv
    import codecs  

    segList = []
    raw_data_path = 'monthly_raw_data/'
    file_name = ["201010", "201011", "201012", "201101", "201103", "201105", "201107", "201109", "201110", "201111", "201112", "201201", "201202", "201203", "201205", "201206", "201208", "201210", "201211"]


    for name in file_name:
        all_text = ""
        multi_line_text = ""
        with open(raw_data_path + name + ".txt", "r") as file:
            for line in file:
                if line != '\n':
                    multi_line_text += line
            templist = multi_line_text.split('\n')
            for text in templist:
                all_text += text
            seg_list = jieba.cut(all_text,cut_all=False)
            temp_text = []
            for item in seg_list:

            stop_list = []
            with open("stopwords.txt", "r") as stoplistfile:
                for item in stoplistfile:

            text_without_stopwords = []
            for word in temp_text:
                if word not in stop_list:


    with open("results/result.csv", 'wb') as f:
        writer = csv.writer(f)


  • For UTF-8 encoding, Excel requires a BOM (byte order mark) codepoint written at the start of the file or it will assume an ANSI encoding, which is locale-dependent. U+FEFF is the Unicode BOM. Here's an example that will open in Excel correctly:

    import csv
    data = [[u'American', u'美国人'],
            [u'Chinese', u'中国人']]
    with open('results.csv','wb') as f:
        w = csv.writer(f)
        for row in data:
            w.writerow([item.encode('utf8') for item in row])

    Python 3 makes this easier. Use 'w', newline='', encoding='utf-8-sig' parameters instead of 'wb' which will accept Unicode strings directly and automatically write a BOM:

    import csv
    data = [['American', '美国人'],
            ['Chinese', '中国人']]
    with open('results.csv', 'w', newline='', encoding='utf-8-sig') as f:
        w = csv.writer(f)

    There is also a 3rd–party unicodecsv module that makes Python 2 easier to use as well:

    import unicodecsv
    data = [[u'American', u'美国人'],
            [u'Chinese', u'中国人']]
    with open('results.csv', 'wb') as f:
        w = unicodecsv.writer(f ,encoding='utf-8-sig')