Search code examples
xmlxpathaxessiblings

XPATH: find all elements with same value until the value changes


Here's a sample XML:

<?xml version="1.0" ?>
<someparent>
    <somechild>
        <description>I want this</description>
        <id>98</id>
    </somechild>
    <somechild>
        <description>I don't want this</description>
        <id>98</id>
    </somechild>
    <somechild>
        <description>I want this too</description>
        <id>2</id>
    </somechild>
    <somechild>
        <description>Nope, not that one</description>
        <id>2</id>
    </somechild>
    <somechild>
        <description>Not that one either</description>
        <id>2</id>
    </somechild>
    <somechild>
        <description>Yep, I want this</description>
        <id>41</id>
    </somechild>
</someparent>

The <id> elements are always grouped: all elements with the same <id> value follow each other in the document. I may have thousands of different <id>s in a single file. What I want is to find each <somechild> element that is the first occurrence of its corresponding <id> group. So my expected result would be:

    <somechild>
        <description>I want this</description>
        <id>98</id>
    </somechild>
    <somechild>
        <description>I want this too</description>
        <id>2</id>
    </somechild>
    <somechild>
        <description>Yep, I want this</description>
        <id>41</id>
    </somechild>

I need a single XPATH command to select all of these "first items in a group". I have tried various combinations of following-sibling and preceding-sibling axes, but I can't get it just quite right. I have come very close to what I want to achieve with the following statement:

//someparent/somechild/id[text()=parent::somechild/preceding-sibling::somechild/id[text()]]/parent::somechild

This actually returns all the nodes I don't want, as it selects all the items that are not the first in their group (so it's essentially a perfect negative of what I want!). But for the life of me, I haven't been able to figure out how to reverse the results.

Any help woud be kindly appreciated.


Solution

  • This O(n2) XPath 1.0 expression,

    //someparent/somechild[not(id = preceding-sibling::somechild/id)]
    

    will select all somechild elements that have no preceding siblings with the same id child element,

       <somechild>
            <description>I want this</description>
            <id>98</id>
        </somechild>
        <somechild>
            <description>I want this too</description>
            <id>2</id>
        </somechild>
        <somechild>
            <description>Yep, I want this</description>
            <id>41</id>
        </somechild>
    

    as requested.


    Update

    Michael Kay noted helpfully that the above XPath has an algorithmic complexity of O(n2), because for each child sibling, all preceding siblings are compared. This won't matter for small numbers of siblings, but OP mentioned thousands, so the size issue becomes a concern.

    See his XPath 3.1 solution, which is a much better O(n).

    He further observed that an O(n) XPath 1.0 expression is possible as long as only immediately preceding siblings have to be checked:

    //someparent/somechild[not(id = preceding-sibling::somechild[1]/id)]
                                                                ^^^
    

    This lower complexity XPath will yield the same results for OP's sample case.

    A differentiating case would involve later siblings with id values that repeat earlier clusters of id values. For example, adding another cluster of id siblings with 98 values:

    <someparent>
      <somechild>
        <description>I want this</description>
        <id>98</id>
      </somechild>
      <somechild>
        <description>I don't want this</description>
        <id>98</id>
      </somechild>
      <somechild>
        <description>I want this too</description>
        <id>2</id>
      </somechild>
      <somechild>
        <description>Nope, not that one</description>
        <id>2</id>
      </somechild>
      <somechild>
        <description>Not that one either</description>
        <id>2</id>
      </somechild>
      <somechild>
        <description>Yep, I want this</description>
        <id>41</id>
      </somechild>
      <somechild>
        <description>REPEAT CASE 1</description>
        <id>98</id>
      </somechild>  
      <somechild>
        <description>REPEAT CASE 2</description>
        <id>98</id>
      </somechild>
    </someparent>
    

    The difference is that the O(n) XPath will not include the REPEAT CASE 1 somechild element, but the O(n2) XPath will include the distantly repeated REPEAT CASE 1:

    <somechild>
        <description>I want this</description>
        <id>98</id>
    </somechild>
    <somechild>
        <description>I want this too</description>
        <id>2</id>
    </somechild>
    <somechild>
        <description>Yep, I want this</description>
        <id>41</id>
    </somechild>
    <somechild>
      <description>REPEAT CASE 1</description>
      <id>98</id>
    </somechild>
    

    As long as the requirements do not require non-immediate comparisons, use the more efficient O(n) XPath.