I have an Excel spreadsheet which looks like this:
Row1 : some text
some other row : "ABC Col" "DEF Col" "GHI Col" "JKL Col"
following rows : more text
I am trying to find the row that contains ABC and JKL. Note the string that I am passing may not match the exact column heading.
I am using xlrd
and am doing something like this:
setOfheadings = set(['ABC', 'JKL'])
found_header = False
for i in range(1,sheet.nrows):
if ((not found_header)):
setOfRowValues = set([element.upper() for element in sheet.row_values(i)])
if len(setOfheadings.intersection(setOfRowValues)) == len(setOfheadings):
(found_header, header_row) = (True,i)
Since 'ABC' does not exactly match 'ABC Col' in the spreadsheet. It fails. It works if my setOfheadings
contains the exact match.
Any ideas on how I can do a regex match with set intersection?
Here's one way of doing it. Features: Factors out the definition of inexact match from the code that iterates over columns and rows. Avoids a crash on non-text data. Bales out when it's found the presumably-desired row.
targets = ('ABC', 'JKL')
def fuzzy_match(target, some_text):
return target in some_text # or something fancier
found_header = False
for i in xrange(1, sheet.nrows):
row_text = [
v.upper()
for v, t in zip(sheet.row_values(i), sheet.row_types(i))
if t == xlrd.XL_CELL_TEXT # avoid non-text cells; see note below
]
found_header = all(
any(fuzzy_match(target, item) for item in row_text)
for target in targets
)
if found_header:
header_row = i
break
The code that avoids non-text cells is xlrd-specific; a more general way of doing it would be:
for v in sheet.row_values(i)
if isinstance(v, basestring)