Search code examples
pythonpython-3.xxpathhtml-parsinglxml

Select the entire text from the following node with child nodes using xpath query in python


I want to extract the content of the following node of an a tag with XPath in python. so far I manage to extract the content with no inside tag in it. the problem is, my method is not working if the following node has a child node in it. I'm using lxml package and here is my code:

from lxml.html import etree, fromstring

reference_titles = root.xpath("//table[@id='vulnrefstable']/tr/td")
for tree in reference_titles:
    a_tag = tree.xpath('a/@href')[0]
    title = tree.xpath('a/following-sibling::text()')

this is working for this html:

<tr>

    <td class="r_average">

        <a href="http://somelink.com" target="_blank" title="External url">
            http://somelink.com
        </a>
        <br/> SECUNIA 27633                     
    </td>

</tr>

Here the title is correctly "SECUNIA 27633" but in this html:

<tr>

    <td class="r_average">

        <a href="http://somelink.com" target="_blank" title="External url">
            http://somelink.com
        </a>
        <br/> SECUNIA 27633     <i>Release Date:</i> tomorrow               
    </td>

</tr>

The result is "SECUNIA 27633 tomorrow"

How can I extract "SECUNIA 27633 Release Date: tomorrow"?


EDIT: using node() instead of text() in XPath returns all the nodes in it. so I use this and create the final string with a nested for statement

title = tree.xpath('a/following-sibling::node()')

but I want to know is there a better way to simply extract the text content regardless of child nodes with XPath query


Solution

  • Try this one:

    for tree in reference_titles:
        a_tag = tree.xpath('a/@href')[0]
        title = " ".join([node.strip() for node in tree.xpath('.//text()[not(parent::a)]') if node.strip()])