The task is to parse big HTML tables so I use lxml with XPath queries. Sometimes table cells can contain enclosed tags (e.g. SPAN)
<html>
<table>
<tr>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td><span>3</span></td>
<td>4</td>
</tr>
</table>
</html>
and I have no idea how to handle it in the proper way. My Python code
from lxml import etree
from io import StringIO
html_text = '<html><table><tr><td>1</td><td>2</td></tr><tr><td><span>3</span></td><td>4</td></tr></table></html>'
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html_text), parser)
rows = tree.xpath('//tr')
for row in rows:
row_values = []
for cell in row:
row_values.append(cell.text)
print(row_values)
generates
['1', '2']
[None, '4']
Could you please give an idea how to handle this kind of issues (encapsulated tags) in proper way? As far as I could presume, I have to get last child of TD or set parser up somehow.
Use cell.xpath('string()')
instead of cell.text
to simply read out the string value of each cell.