Search code examples
python-3.xhtml-parsingpython-requests-html

Extract text only from the parent tag with Requests-HTML


I'd like to extract text only from the parent tag using Requests-HTML. If we have html like this

<td>
    <a href="">There</a> <a href="">are</a> <a href="">some</a> <a href="">links.</a> The text that we are looking for.
<td>

then

html.find('td', first=True).text

results in

>>> There are some links. The text that we are looking for.


Solution

  • You can use an xpath expression, which is directly supported by the library

    from requests_html import HTML
    doc = """<td>
        <a href="">There</a> <a href="">are</a> <a href="">some</a> <a href="">links/</a> The text that we are looking for.
    <td>"""
    html = HTML(html=doc)
    # the list will contain all the whitespaces "between" <a> tags
    text_list = html.xpath('//td/text()')
    # join the list and strip the whitespaces
    print(''.join(text_list).strip())  # The text that we are looking for.
    

    The expression //td/text() will select all td nodes and their text root text content (//td//text() would select all text content).