Search code examples
excelpython-3.xpandasxlrd

Can pandas implicitly determine header based on value, not row?


I work with people who use Excel and continuously add or subtract rows unbeknownst to me. I have to scrape a document for data, and the row where the header is found changes based on moods.

My challenge is to handle these oscillating currents by detecting where the header is.

I first organized my scrape using xlrd and a number of conditional statements using the values in the workbook.

My initial attempt works and is long (so I will not publish it) but involves bringing in the entire sheet, and not slices:

from xlrd import open_workbook

book = open_workbook(fName)
sheet = book.sheet_by_name(sht)

return book,sheet

However, it is big and I would prefer to get a more targeted selection. The header values never change, nor does when the data shows up after this row.

Do you know of a way to implicitly get the header based on a found value in the sheet using either pandas.ExcelFile or pandas.read_excel?

Here is my attempt with pandas.ExcelFile:

import pandas as pd

xlsx = pd.ExcelFile(fName)
dataFrame = pd.read_excel(xlsx, sht,
                          parse_cols=21, merge_cells=noMerge, 
                          header=header)

return dataFrame

I cannot get the code to work unless I give the call the correct header value, which is exactly what I'm hoping to avoid.

This previous question seems to present a similar problem without addressing the concern of finding the headers implicitly.


Solution

  • Do the same loop through ExcelFile objects:

    xlsx = pd.ExcelFile(fName)
    sheet = xlsx.sheet_by_name(sht)
    # apply the same algorithm you wrote against xlrd here
    # ... results in having header_row = something, 0 based
    dataFrame = pd.read_excel(xlsx, sht,
                          parse_cols=21, merge_cells=noMerge, 
                          skip_rows=header_row)