I have this problem, I am processing some tables using lxml- the original source files are in mhtml format, they are excel files. I am needing to find the rows that contain the header elements 'th' elements. I want to use the header elements but need the rows they came from to make sure I process everything in order.
So what I have been doing is finding all of the th elements and then from those using the e.getparent() function to get the row (since a th is a child of a row). But I end up having to pull the th elements twice, once to find them and get the rows and then again to take them out of the rows to parse the data I am looking for. This can't be the best way to do this so I am wondering if there is something I am missing.
Here is my code
from lxml import html
theString=unicode(open('c:\\secexcel\\1314054-R20110331-C20101231-F60-SEQ132.xls').read(),'UTF-8','replace')
theTree=html.fromstring(theString)
tables=[e for e in theTree.iter() if e.tag=='table']
for table in tables :
headerCells=[e for e in table.iter() if e.tag=='th']
headerRows=[]
for headerCell in headerCells:
if headerCell.getparent().tag=='tr':
if headerCell.getparent() not in headerRows:
headerRows.append(headerCell.getparent())
for headerRow in headerRows:
newHeaderCells=[e for e in headerRow.iter() if e.tag=='th']
#Now I will extract some data and attributes from the th elements
Iterate over all tr
tags, and just move on to the next one when you find no th
inside.
EDIT. This is how:
from lxml import html
theString=unicode(open('c:\\secexcel\\1314054-R20110331-C20101231-F60-SEQ132.xls').read(),'UTF-8','replace')
theTree=html.fromstring(theString)
for table in theTree.iter('table'):
for row in table.findall('tr'):
headerCells = list(row.findall('th'))
if headerCells:
#extract data from row and headerCells