Search code examples
pythonfuzzy-comparison

Parsing for Column Headings in Excel worksheet with Python


I have an Excel spreadsheet which looks like this:

Row1 :           some text
some other row : "ABC Col" "DEF Col" "GHI Col" "JKL Col"
following rows : more text

I am trying to find the row that contains ABC and JKL. Note the string that I am passing may not match the exact column heading.

I am using xlrd and am doing something like this:

setOfheadings = set(['ABC', 'JKL'])
found_header = False
for i in range(1,sheet.nrows):
    if ((not found_header)):
        setOfRowValues = set([element.upper() for element in sheet.row_values(i)])
        if len(setOfheadings.intersection(setOfRowValues)) == len(setOfheadings):
            (found_header, header_row) = (True,i)

Since 'ABC' does not exactly match 'ABC Col' in the spreadsheet. It fails. It works if my setOfheadings contains the exact match.

Any ideas on how I can do a regex match with set intersection?


Solution

  • Here's one way of doing it. Features: Factors out the definition of inexact match from the code that iterates over columns and rows. Avoids a crash on non-text data. Bales out when it's found the presumably-desired row.

    targets = ('ABC', 'JKL')
    
    def fuzzy_match(target, some_text):
        return target in some_text # or something fancier
    
    found_header = False    
    for i in xrange(1, sheet.nrows):
        row_text = [
            v.upper()
            for v, t in zip(sheet.row_values(i), sheet.row_types(i))
            if t == xlrd.XL_CELL_TEXT # avoid non-text cells; see note below
            ]
        found_header = all(
            any(fuzzy_match(target, item) for item in row_text)
            for target in targets
            )
        if found_header:
            header_row = i
            break
    

    The code that avoids non-text cells is xlrd-specific; a more general way of doing it would be:

    for v in sheet.row_values(i)
    if isinstance(v, basestring)