I want to extract the content of the following node of an a tag
with XPath
in python. so far I manage to extract the content with no inside tag in it. the problem is, my method is not working if the following node has a child node in it. I'm using lxml
package and here is my code:
from lxml.html import etree, fromstring
reference_titles = root.xpath("//table[@id='vulnrefstable']/tr/td")
for tree in reference_titles:
a_tag = tree.xpath('a/@href')[0]
title = tree.xpath('a/following-sibling::text()')
this is working for this html:
<tr>
<td class="r_average">
<a href="http://somelink.com" target="_blank" title="External url">
http://somelink.com
</a>
<br/> SECUNIA 27633
</td>
</tr>
Here the title is correctly "SECUNIA 27633" but in this html:
<tr>
<td class="r_average">
<a href="http://somelink.com" target="_blank" title="External url">
http://somelink.com
</a>
<br/> SECUNIA 27633 <i>Release Date:</i> tomorrow
</td>
</tr>
The result is "SECUNIA 27633 tomorrow
"
How can I extract "SECUNIA 27633 Release Date: tomorrow
"?
EDIT: using node()
instead of text()
in XPath
returns all the nodes in it. so I use this and create the final string with a nested for
statement
title = tree.xpath('a/following-sibling::node()')
but I want to know is there a better way to simply extract the text content regardless of child nodes with XPath
query
Try this one:
for tree in reference_titles:
a_tag = tree.xpath('a/@href')[0]
title = " ".join([node.strip() for node in tree.xpath('.//text()[not(parent::a)]') if node.strip()])