Search code examples
pythonpandasutf-8python-2.xpyexcelerate

Writing special chäracters with pyExcelerate? How to keep utf-8 encoding?


I want to write data from a Pandas dataframe to an Excel file. I am using pyExcelerate rather than the built in to_excel() method to get better performance.

While Pandas handles utf-8 encoded special characters, the encoding seems to be lost when converting to python list. I read through https://docs.python.org/2/howto/unicode.html, however I could not find out how to preserve the encoding. Or is there some way to restore encoding at the time of writing?

# -*- coding: utf-8 -*-
import pandas as pd
from pyexcelerate import Workbook 

df = pd.DataFrame({'val': ['ä']})
print df
#   val
# 0   ä  <- non ascii character a with dots

value = df.values.tolist()
print value
# [['\xc3\xa4']]

wb = Workbook()
ws = wb.new_sheet("sheet name", data=value)
wb.save("output.xlsx")

# UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 31: ordinal not in range(128)

Solution

  • Simple, feed your Dataframe with Unicodes by using Unicode literals:

    df = pd.DataFrame({u'val': [u'ä']})
    

    note the u prefixes

    Ensure your file is encoded with UTF-8 (to match the # -*- coding: utf-8 -*- header) (It does looks UTF-8 encoded as 0xc3 is the first byte of ä)

    The error is caused by Python not knowing the encoding of the byte string, when the string is marshalled into a Unicode object later.