Search code examples
pythonhtml-parsinglxml

HTML getnext using lxml parser


I am trying to parse a page with html code as below:

<html>
..
<h2><span id='identifiedid'>Identified Header<span>...</span></span></h2>
<ul>
  <li><a href='links i want'></a>...</li>
  <li><a href='links i want'></a>...</li>
  <li><a href='links i want'></a>...</li>
</ul>
..
</html>

I am using a Python code for parsing the page with an lxml parser. I am able to identify the id of the element indicated using xpath. But, the links I need to access don't have a class/id to identify them and they also are not under the span of the id. Is there any way to access these links of the adjacent element? I have tried getnext(), but it is not able to access the ul and li elements


Solution

  • You can get the parent of the span using getparent() and then get the ul element using getnext():

    root = etree.XML(open("lx.xml").read())
    span = root.xpath("//span[@id='identifiedid']")[0]
    print span.getparent().getnext().xpath('li/a/@href')