Search code examples
excelpython-2.7csvpandasxlrd

How to convert n rows of xlsx to csv in Python while preserving date values


I am trying to convert an xlsx file to one CSV file containing the header and another CSV file containing the actual data. I have the following requirements:

  1. Header does not start at first row but at row start_line.
  2. Dates should not be considered as floats but in some string format.
  3. I don't know the total number of rows or columns of the file beforehand. I also don't want to specify which column is a date.

Using pandas I get stuck at number 1. I wanted to achieve this in two separate reads where I read from start_line to start_line+1 and from start_line+1 to the end. However it seems like it is not possible to read n lines from an offset. Below is the code I use to just get one file including the header.

import pandas as pd
def parse_excel(file,start_line,sheet,table):
    sh = pd.read_excel(file,sheet,skiprows=start_line)
    sh.to_csv("output.csv",sep='\t',encoding='utf-8',index=False)

Next I have tried this using xlrd but this library treats all dates as floats like in Excel. The only workaround here seems to go through all individual cells which does not seem very efficient or well coded. What I have now:

import xlrd
def parse_excel(file,start_line,sheet,table):
    with xlrd.open_workbook(file) as wb:
        sh = wb.sheet_by_name(sheet)
        header_written = False
        with open('{0}.csv'.format(table),'wb') as csv_file:
            wr = csv.writer(csv_file,delimiter='\t')
            for rownum in range(sh.nrows):
                if not header_written and start_line == rownum:
                    with open('{0}_header.csv'.format(table),'wb') as header:
                        hwr = csv.writer(header,delimiter='\t')
                        hwr.writerow(sh.row_values(rownum))
                        header_written = True
                elif header_written:
                    wr.writerow(sh.row_values(rownum))

Please point me to other solutions/libraries, show a workaround for either one of the above or explain why I should go for the xlrd workaround checking each individual cell.


Solution

  • As long as all of your data is below your header row then following should work. Assuming the header row is at row n (indexing beginning at 0 not 1 like excel).

    df = pd.read_excel('filepath', header=n)
    df.head(0).to_csv('header.csv', index=False)
    df.to_csv('output.csv', header=None, index=False)