Let's say I have an XML file like this one:
<books>
<book>
<title>John is alive</title>
<abstract>
A man is found alive after having disappeared for 10 years.
</abstract>
<description>
<en> John disappeared 10 years ago. Lorem ipsum dolor sit amet ...</en>
<fr> Il y a 10 ans, John disparaissait. Lorem ipsum dolor sit amet ...</fr>
</description>
<notes>First book in the series, where the character is introduced</notes>
</book>
<book>
<title>The disappearance of John</title>
<abstract>
A prequel to the book "John is alive".
</abstract>
<description>
<en> He lead an ordinary life, but then ... lorem ipsum dolor sit amet ...</en>
<fr> Sa vie était tout à fait ordinaire, mais ... lorem ipsum dolor sit amet ...</fr>
</description>
<notes>Second book in the "John" series, but first in chronological order</notes>
</book>
</books>
My question is simple: how can I, using XPATH, get a collection of all nodes that contain the word John
?
Obviously, I can specify a series of nodes and that works fine:
(//title | //abstract | //description/* | //notes)[contains(lower-case(text()),"john")]
But if my XML grows (and it will!), with new elements being added at various levels in the structure, I don't want to constantly have to go back and adjust my XPATH.
What I fail to understand is why a generic statement like
//*[contains(lower-case(text()),"john")]
fails with this error message Required cardinality of first argument of lower-case() is one or zero
.
Yet, not all statements with an asterisk fail.
For instance:
//books/book/*[contains(lower-case(text()),"john")]
fails with the above error message
while
//books/book/*/*[contains(lower-case(text()),"john")]
succeeds and retrieves both the <en>
and <fr>
nodes from the first <description>
element
If it's not possible, fine, I will list all elements in my XPATH, but I still would like to get a clear understanding of the behavior of the *
selector in the context of a contains()
operation.
There's some ambiguity regarding the term nodes (see XPath difference between child::* and child::node()) and the term contains (see How to use XPath contains() for specific text?) when being less than perfectly precise, but one of the following XPaths will likely meet your needs:
All nodes whose string value contains the substring, "John"
:
//node()[contains(.,"John")]
All such elements:
//*[contains(.,"John")]
All such attributes:
//@*[contains(.,"John")]
All such text nodes:
//text()[contains(.,"John")]
All elements with text node children that contain the substring, "John"
:
//*[text()[contains(.,"John")]]
Notice that #1 will include books
, but #5 will exclude it. See Testing text() nodes vs string values in XPath.
You can replace contains(.,"John")
with contains(lower-case(.),"john")
in any of the above XPaths if you're using XPath 2.0. See also Case insensitive XPath contains() possible?