Search code examples
htmlxpathsimplexml

xpath find link containing HTML in page


This is not the same question as xpath find specific link in page . I've got <a href="http://example.com">foo <em class="bar">baz</em>.</a>. and need to find the link by the full foo <em class="bar">baz</em>. including the closing dot.


Solution

  • Note: I'm following up on OP's comment

    A (visually) simpler variation of OP's own answer could be:

    //a[. = "foo baz."][em[@class = "bar"] = "baz"]
    

    or even:

    //a[.="foo baz." and em[@class="bar"]="baz"]
    

    (assuming you want to select the <a> node, and not the child <em>)

    Regarding OP's question:

    why the [em[]= doesn't need the dot?

    Inside a predicate, testing = against a string on the right will convert the left part to a string, here <em> to its string representation, i.e. what string() would return.

    XPath 1.0 specification document has an example of this:

    chapter[title="Introduction"] selects the chapter children of the context node that have one or more title children with string-value equal to "Introduction"

    Later, the same spec says on boolean tests:

    If one object to be compared is a node-set and the other is a string, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the string-value of the node and the other string is true.

    In OP's answer, //a[string() = 'bar baz.']/em[@class='bar' and .='baz'], the . is needed since the test on 'baz' is on the context node

    Note that my answer is somewhat naive and assumes there's only 1 <em> child of <a>, because [em[@class="bar"]="baz"] is looking for one em[@class="bar"] matching the string-value condition, not that it's the only or first one.

    Consider this input (a second <em class="bar"> child, but empty):

    <a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>.
    

    and this test using Scrapy selectors

    >>> import scrapy
    >>> s = scrapy.Selector(text="""<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>.""")
    >>> s.xpath('//a[.="foo baz." and em[@class="bar"]="baz"]').extract_first()
    u'<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>'
    >>> 
    

    The XPath matches but you may not want this.