Search code examples
pythonpandascsvutf-8writetofile

How to write 'utf-8' to a new CSV file using python3 with Anaconda?


How to write 'utf-8' to a new CSV file using python3 with Anaconda?

I'm a new python and pandas learner. The version I use is python3. I run it with Anaconda platform, an IDE as similar as PyCharm IDE.

I have two arrays to record all words and their frequency from a long text. All the word is kept in form of string which include 'utf-8' character:

value = [13, 4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] 

key = ['-', 'Span', 'Found', 'Not', '404.0', '详细', '8.5', 'IIS', 'Details', '错误', 'Machine,', 'K', 'Ltd.', 'Co.,', 'Contact', 'Group', 'Large', 'qinwomachine', 'Trading', 'Qinwo', 'Shanghai', 'Manufacturer', 'Machine', 'Super', 'Abm240', 'Abm120', 'Mic240', 'Mic120', 'Forming', 'Roll', 'wubianstar', 'Electrical', 'Hont', 'China', 'tileformer', '\ufeffContact']

Now I'm trying to write those value and key array to a new CSV file called split_word.csv using python3 with Anaconda. My code is as follows:

# read the arrays as dataframe, also set the column name 'word' and 'frequency'
df = pd.DataFrame({"word" : newkey, "frequency" : newvalue}) 

# write dataframe into a new csv file
df.to_csv("split_word.csv", index=False)

My expected result in the csv is two new columns:

frequency   word
13          -
4           Span
3           Found
3           Not
3           404
3           详细
3           8.5
3           IIS
3           Details
2           错误
2           Machine,
2           K
2           Ltd.
2           Co.,
2           Contact

But there's something wrong with the actual result. '详细' and '错误' are missing:

frequency   word
13          -
4           Span
3           Found
3           Not
3           404
3           ????
3           8.5
3           IIS
3           Details
2           ????
2           Machine,
2           K
2           Ltd.
2           Co.,
2           Contact

So the only problem is the 'utf-8' input. Should I add decode or encode into the code? How can I solve the simple but annoying problem?

Thank you so much!


Solution

  • You just need to specify the encoding:

    df.to_csv("split_word.csv", index=False, encoding="utf-8")