Search code examples
xmlxpathcontains

Using XPATH, how to select ANY node that contains a certain string


Let's say I have an XML file like this one:

<books>
  <book>
    <title>John is alive</title>
    <abstract>
        A man is found alive after having disappeared for 10 years.
    </abstract>
    <description>
        <en> John disappeared 10 years ago. Lorem ipsum dolor sit amet ...</en>
        <fr> Il y a 10 ans, John disparaissait. Lorem ipsum dolor sit amet ...</fr>
    </description>
    <notes>First book in the series, where the character is introduced</notes>
  </book>
  <book>
    <title>The disappearance of John</title>
    <abstract>
        A prequel to the book "John is alive".
    </abstract>
    <description>
        <en> He lead an ordinary life, but then ... lorem ipsum dolor sit amet ...</en>
        <fr> Sa vie était tout à fait ordinaire, mais ... lorem ipsum dolor sit amet ...</fr>
    </description>
    <notes>Second book in the "John" series, but first in chronological order</notes>
  </book>
</books>

My question is simple: how can I, using XPATH, get a collection of all nodes that contain the word John?

Obviously, I can specify a series of nodes and that works fine:

(//title | //abstract | //description/* | //notes)[contains(lower-case(text()),"john")]

But if my XML grows (and it will!), with new elements being added at various levels in the structure, I don't want to constantly have to go back and adjust my XPATH.

What I fail to understand is why a generic statement like

//*[contains(lower-case(text()),"john")]

fails with this error message Required cardinality of first argument of lower-case() is one or zero.

Yet, not all statements with an asterisk fail.

For instance:

//books/book/*[contains(lower-case(text()),"john")] fails with the above error message

while

//books/book/*/*[contains(lower-case(text()),"john")] succeeds and retrieves both the <en> and <fr> nodes from the first <description> element

If it's not possible, fine, I will list all elements in my XPATH, but I still would like to get a clear understanding of the behavior of the * selector in the context of a contains() operation.


Solution

  • There's some ambiguity regarding the term nodes (see XPath difference between child::* and child::node()) and the term contains (see How to use XPath contains() for specific text?) when being less than perfectly precise, but one of the following XPaths will likely meet your needs:

    1. All nodes whose string value contains the substring, "John":

      //node()[contains(.,"John")]
      
    2. All such elements:

      //*[contains(.,"John")]
      
    3. All such attributes:

      //@*[contains(.,"John")]
      
    4. All such text nodes:

      //text()[contains(.,"John")]
      
    5. All elements with text node children that contain the substring, "John":

      //*[text()[contains(.,"John")]]
      

    Notice that #1 will include books, but #5 will exclude it. See Testing text() nodes vs string values in XPath.

    You can replace contains(.,"John") with contains(lower-case(.),"john") in any of the above XPaths if you're using XPath 2.0. See also Case insensitive XPath contains() possible?