Search code examples
pythonregexhtml-parsinglxml

Applying root.xpath() with regex returns a lxml.etree._ElementUnicodeResult


I'm generating a model to find out where a piece of text is located in an HTML file.

So, I have a database with plenty of data from different newspaper's articles with data like title, publish date, authors and news text. What I'm trying to do is by analyzing this data, generate a model that can find by itself the XPath to the HTML tags with this content.

The problem is when I use a regex within the xpath method as shown here:

from lxml import html

with open('somecode.html', 'r') as f:
    root = html.fromstring(f.read())

list_of_xpaths = root.xpath('//*/@*[re:match(.,"2019-04-15")]')

This is an example of searching for the publish date in the code. It returns a lxml.etree._ElementUnicodeResult instead of lxml.etree._Element.

Unfortunately, this type of element doesn't let me get the XPath to where is it locate like an lxml.etree._Element after applying root.getroottree().getpath(list_of_xpaths[0]).

Is there a way to get the XPath for this type of element? How?

Is there a way to lxml with regex return an lxml.etree._ElementUnicodeResult element instead?


Solution

  • The problem is that you get an attribute value represented as an instance of _ElementUnicodeResult class.

    If we introspect what _ElementUnicodeResult class provides, we could see that it allows you to get to the element which has this attribute via .getparent() method:

    attribute = list_of_xpaths[0]
    element = attribute.getparent()
    
    print(root.getroottree().getpath(element))
    

    This would get us a path to the element, but as we need an attribute name as well, we could do:

    print(attribute.attrname) 
    

    Then, to get the complete xpath pointing at the element attribute, we may use:

    path_to_element = root.getroottree().getpath(element)
    attribute_name = attribute.attrname
    
    complete_path = path_to_element + "/@" + attribute_name
    print(complete_path)
    

    FYI, _ElementUnicodeResult also indicates if this is actually an attribute via .is_attribute property (as this class also represents text nodes and tails as well).