I am trying to convert an xlsx file to one CSV file containing the header and another CSV file containing the actual data. I have the following requirements:
start_line
.Using pandas
I get stuck at number 1.
I wanted to achieve this in two separate reads where I read from start_line to start_line+1
and from start_line+1
to the end.
However it seems like it is not possible to read n lines from an offset. Below is the code I use to just get one file including the header.
import pandas as pd
def parse_excel(file,start_line,sheet,table):
sh = pd.read_excel(file,sheet,skiprows=start_line)
sh.to_csv("output.csv",sep='\t',encoding='utf-8',index=False)
Next I have tried this using xlrd
but this library treats all dates as floats like in Excel. The only workaround here seems to go through all individual cells which does not seem very efficient or well coded. What I have now:
import xlrd
def parse_excel(file,start_line,sheet,table):
with xlrd.open_workbook(file) as wb:
sh = wb.sheet_by_name(sheet)
header_written = False
with open('{0}.csv'.format(table),'wb') as csv_file:
wr = csv.writer(csv_file,delimiter='\t')
for rownum in range(sh.nrows):
if not header_written and start_line == rownum:
with open('{0}_header.csv'.format(table),'wb') as header:
hwr = csv.writer(header,delimiter='\t')
hwr.writerow(sh.row_values(rownum))
header_written = True
elif header_written:
wr.writerow(sh.row_values(rownum))
Please point me to other solutions/libraries, show a workaround for either one of the above or explain why I should go for the xlrd
workaround checking each individual cell.
As long as all of your data is below your header row then following should work. Assuming the header row is at row n
(indexing beginning at 0 not 1 like excel).
df = pd.read_excel('filepath', header=n)
df.head(0).to_csv('header.csv', index=False)
df.to_csv('output.csv', header=None, index=False)