Search code examples
pythonxpathhtml-parsinglxml

Parsing HTML table (lxml, XPath) with enclosed tags


The task is to parse big HTML tables so I use lxml with XPath queries. Sometimes table cells can contain enclosed tags (e.g. SPAN)

<html>
  <table>
    <tr>
      <td>1</td>
      <td>2</td>
    </tr>
    <tr>
      <td><span>3</span></td>
      <td>4</td>
    </tr>
  </table>
</html>

and I have no idea how to handle it in the proper way. My Python code

from lxml import etree
from io import StringIO

html_text = '<html><table><tr><td>1</td><td>2</td></tr><tr><td><span>3</span></td><td>4</td></tr></table></html>'

parser = etree.HTMLParser()
tree = etree.parse(StringIO(html_text), parser)
rows = tree.xpath('//tr')

for row in rows:
    row_values = []
    for cell in row:
        row_values.append(cell.text)
    print(row_values)

generates

['1', '2']
[None, '4']

Could you please give an idea how to handle this kind of issues (encapsulated tags) in proper way? As far as I could presume, I have to get last child of TD or set parser up somehow.


Solution

  • Use cell.xpath('string()') instead of cell.text to simply read out the string value of each cell.