Search code examples
pythonexcelpython-2.7export-to-csvxlrd

Output Excel worksheet with special characters to CSV


I have an Excel file with multiple worksheets. I want to take a specific worksheet from the file and save this as a CSV file, delimited by commas ( I haven't gotten this far as yet). The worksheet contains data with special characters similar to last column on the right below. I don't mind ignoring these characters.

    **DateStamp Country ComputerName    Domain    IPAddress   OperatingSystem**
    2017-05-24  USA     Computer1       Domain1   1.2.3.4     Windows 2008
    2017-05-24  England Computer2       Domain2   1.2.3.5     Windows Server® 2008

My code so far is:

import os
import xlrd
import sys

file = 'path/to/my/file.xlsx'
workbook = xlrd.open_workbook(file)
sheet = workbook.sheet_by_name('Data')

for rowx in range(sheet.nrows):
    coldata = sheet.row_values(rowx)
    coldata = " ".join(str(x) for x in coldata).encode('ascii')
with open ('/path/log.txt','a') as results:
    results.write(coldata)
results.close()

I am having to convert the data to string as otherwise it will include the u' next to each field, which i dont want in my CSV. On trying to do that I am encountering various errors. The error I am getting is:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 14: ordinal not in range(128)

If i try to ignore the characters, i land up with this:

colsdata = colsdata.decode('ascii','ignore')
Attribute error: 'list' object has no attribute 'decode'

These errors take place before I am able to write to the log.

Please may I get some assistance. Thanks!


Solution

  • This should do the trick. It will ignore the unicode characters that it can't convert.

    import os
    import xlrd
    import sys
    import unicodedata
    
    workbook_path = "path/to/my/file.xlsx"
    workbook = xlrd.open_workbook(workbook_path)
    sheet = workbook.sheet_by_name("Sheet1")
    csv_data = ""
    
    def normalize(value):
         result = unicodedata.normalize("NFKD", unicode(value)).encode("ascii","ignore")
         return result
    
    for rowx in range(sheet.nrows):
        coldata = sheet.row_values(rowx)
        # Append data to string that we are going to output
        # and add new line
        csv_data += ", ".join(normalize(x) for x in coldata) + "\n"
    
    with open ("/path/log.txt","a") as results:
        results.write(csv_data)
    

    I tested it with an excel file that had some bogus data and the text Windows Server 20008 ® and it returned the following for csv_data:

    print csv_data
    
    foo, bar, baz
    1.0, 2.0, 3.0
    4.0, 5.0, 6.0
    7.0, 8.0, Windows Server 2008