Search code examples
pythonhtmllxml

How to use lxml to find an element by text?


Assume we have the following html:

<html>
    <body>
        <a href="/1234.html">TEXT A</a>
        <a href="/3243.html">TEXT B</a>
        <a href="/7445.html">TEXT C</a>
    <body>
</html>

How do I make it find the element "a", which contains "TEXT A"?

So far I've got:

root = lxml.html.document_fromstring(the_html_above)
e = root.find('.//a')

I've tried:

e = root.find('.//a[@text="TEXT A"]')

but that didn't work, as the "a" tags have no attribute "text".

Is there any way I can solve this in a similar fashion to what I've tried?


Solution

  • You are very close. Use text()= rather than @text (which indicates an attribute).

    e = root.xpath('.//a[text()="TEXT A"]')
    

    Or, if you know only that the text contains "TEXT A",

    e = root.xpath('.//a[contains(text(),"TEXT A")]')
    

    Or, if you know only that text starts with "TEXT A",

    e = root.xpath('.//a[starts-with(text(),"TEXT A")]')
    

    See the docs for more on the available string functions.


    For example,

    import lxml.html as LH
    
    text = '''\
    <html>
        <body>
            <a href="/1234.html">TEXT A</a>
            <a href="/3243.html">TEXT B</a>
            <a href="/7445.html">TEXT C</a>
        <body>
    </html>'''
    
    root = LH.fromstring(text)
    e = root.xpath('.//a[text()="TEXT A"]')
    print(e)
    

    yields

    [<Element a at 0xb746d2cc>]