Parsing for Column Headings in Excel worksheet with Python

I have an Excel spreadsheet which looks like this:

Row1 :           some text
some other row : "ABC Col" "DEF Col" "GHI Col" "JKL Col"
following rows : more text

I am trying to find the row that contains ABC and JKL. Note the string that I am passing may not match the exact column heading.

I am using xlrd and am doing something like this:

setOfheadings = set(['ABC', 'JKL'])
found_header = False
for i in range(1,sheet.nrows):
    if ((not found_header)):
        setOfRowValues = set([element.upper() for element in sheet.row_values(i)])
        if len(setOfheadings.intersection(setOfRowValues)) == len(setOfheadings):
            (found_header, header_row) = (True,i)

Since 'ABC' does not exactly match 'ABC Col' in the spreadsheet. It fails. It works if my setOfheadings contains the exact match.

Any ideas on how I can do a regex match with set intersection?

Solution

Here's one way of doing it. Features: Factors out the definition of inexact match from the code that iterates over columns and rows. Avoids a crash on non-text data. Bales out when it's found the presumably-desired row.

targets = ('ABC', 'JKL')

def fuzzy_match(target, some_text):
    return target in some_text # or something fancier

found_header = False    
for i in xrange(1, sheet.nrows):
    row_text = [
        v.upper()
        for v, t in zip(sheet.row_values(i), sheet.row_types(i))
        if t == xlrd.XL_CELL_TEXT # avoid non-text cells; see note below
        ]
    found_header = all(
        any(fuzzy_match(target, item) for item in row_text)
        for target in targets
        )
    if found_header:
        header_row = i
        break

The code that avoids non-text cells is xlrd-specific; a more general way of doing it would be:

for v in sheet.row_values(i)
if isinstance(v, basestring)