This is not the same question as xpath find specific link in page . I've got <a href="http://example.com">foo <em class="bar">baz</em>.</a>.
and need to find the link by the full foo <em class="bar">baz</em>.
including the closing dot.
Note: I'm following up on OP's comment
A (visually) simpler variation of OP's own answer could be:
//a[. = "foo baz."][em[@class = "bar"] = "baz"]
or even:
//a[.="foo baz." and em[@class="bar"]="baz"]
(assuming you want to select the <a>
node, and not the child <em>
)
Regarding OP's question:
why the
[em[]=
doesn't need the dot?
Inside a predicate, testing =
against a string on the right will convert the left part to a string, here <em>
to its string representation, i.e. what string()
would return.
XPath 1.0 specification document has an example of this:
chapter[title="Introduction"]
selects the chapter children of the context node that have one or more title children with string-value equal to "Introduction"
Later, the same spec says on boolean tests:
If one object to be compared is a node-set and the other is a string, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the string-value of the node and the other string is true.
In OP's answer, //a[string() = 'bar baz.']/em[@class='bar' and .='baz']
, the .
is needed since the test on 'baz'
is on the context node
Note that my answer is somewhat naive and assumes there's only 1 <em>
child of <a>
, because [em[@class="bar"]="baz"]
is looking for one em[@class="bar"]
matching the string-value condition, not that it's the only or first one.
Consider this input (a second <em class="bar">
child, but empty):
<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>.
and this test using Scrapy selectors
>>> import scrapy
>>> s = scrapy.Selector(text="""<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>.""")
>>> s.xpath('//a[.="foo baz." and em[@class="bar"]="baz"]').extract_first()
u'<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>'
>>>
The XPath matches but you may not want this.