I am trying to parse a page with html code as below:
<html>
..
<h2><span id='identifiedid'>Identified Header<span>...</span></span></h2>
<ul>
<li><a href='links i want'></a>...</li>
<li><a href='links i want'></a>...</li>
<li><a href='links i want'></a>...</li>
</ul>
..
</html>
I am using a Python code for parsing the page with an lxml parser. I am able to identify the id of the element indicated using xpath. But, the links I need to access don't have a class/id to identify them and they also are not under the span of the id. Is there any way to access these links of the adjacent element? I have tried getnext(), but it is not able to access the ul and li elements
You can get the parent of the span
using getparent()
and then get the ul
element using getnext()
:
root = etree.XML(open("lx.xml").read())
span = root.xpath("//span[@id='identifiedid']")[0]
print span.getparent().getnext().xpath('li/a/@href')