xpath find link containing HTML in page

This is not the same question as xpath find specific link in page . I've got <a href="http://example.com">foo baz.</a>. and need to find the link by the full foo baz. including the closing dot.

Solution

Note: I'm following up on OP's comment

A (visually) simpler variation of OP's own answer could be:

//a[. = "foo baz."][em[@class = "bar"] = "baz"]

or even:

//a[.="foo baz." and em[@class="bar"]="baz"]

(assuming you want to select the <a> node, and not the child )

Regarding OP's question:

why the [em[]= doesn't need the dot?

Inside a predicate, testing = against a string on the right will convert the left part to a string, here  to its string representation, i.e. what string() would return.

XPath 1.0 specification document has an example of this:

chapter[title="Introduction"] selects the chapter children of the context node that have one or more title children with string-value equal to "Introduction"

Later, the same spec says on boolean tests:

If one object to be compared is a node-set and the other is a string, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the string-value of the node and the other string is true.

In OP's answer, //a[string() = 'bar baz.']/em[@class='bar' and .='baz'], the . is needed since the test on 'baz' is on the context node

Note that my answer is somewhat naive and assumes there's only 1  child of <a>, because [em[@class="bar"]="baz"] is looking for one em[@class="bar"] matching the string-value condition, not that it's the only or first one.

Consider this input (a second  child, but empty):

<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>.

and this test using Scrapy selectors

>>> import scrapy
>>> s = scrapy.Selector(text="""<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>.""")
>>> s.xpath('//a[.="foo baz." and em[@class="bar"]="baz"]').extract_first()
u'<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>'
>>>

The XPath matches but you may not want this.