Search code examples
xmlxsltgroupingcounting

grouping by id and counting duplicates with XSLT


I have below source XML that I'm able to group by id and count the duplicates:

<?xml version="1.0" encoding="utf-8"?>
<cases>
    <case id="1" cont="">
        <serial>111</serial>        
    </case>
    <case id="1" cont="">
        <serial>111</serial>
    </case>
    <case id="2" cont="">
        <serial>222</serial>
    </case>
</cases>

XSLT 1.0

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>

<xsl:key name="caseKey" match="case" use="@id"/>

<xsl:template match="cases">
    <output>
        <xsl:apply-templates select="@*|case[generate-id()=generate-id(key('caseKey', @id)[1])]"/>
    </output>
</xsl:template>

<xsl:template match="case">
    <xsl:element name="id">
        <xsl:attribute name="val"><xsl:value-of select="@id"></xsl:value-of></xsl:attribute>
        <xsl:element name="duplicates">
            <xsl:value-of select="count(key('caseKey', @id))-1"></xsl:value-of>
        </xsl:element>
    </xsl:element>      
</xsl:template>
</xsl:stylesheet>

XSLT 2.0

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>

<xsl:template match="/cases">
    <output>
        <xsl:for-each-group select="case" group-by="@id">
            <xsl:element name="id">
                <xsl:attribute name="val"><xsl:value-of select="@id"></xsl:value-of></xsl:attribute>
                <xsl:element name="duplicates">
                    <xsl:value-of select="count(current-group())-1"></xsl:value-of>
                </xsl:element>
            </xsl:element> 
        </xsl:for-each-group>
    </output>
</xsl:template>

</xsl:stylesheet>

Output:

<?xml version="1.0" encoding="UTF-8"?>
<output>
   <id val="1">
      <duplicates>1</duplicates>
   </id>
   <id val="2">
      <duplicates>0</duplicates>
   </id>
</output>

Now, my challenge is that one case can continue in another case and for that cont attribute will have values like 1 | 2 and 2 | 2 making that case unique, so far I haven't take into consideration the cont attribute for the key, but now I think I have to:

<?xml version="1.0" encoding="utf-8"?>
<cases>
    <case id="1" cont="1 | 2">
        <serial>111</serial>        
    </case>
    <case id="1" cont="2 | 2">
        <serial>111</serial>
    </case>
    <case id="2" cont="">
        <serial>222</serial>
    </case>
    <case id="3" cont="">
        <serial>333</serial>
    </case>
    <case id="3" cont="">
        <serial>333</serial>
    </case>
    <case id="1" cont="1 | 2">
        <serial>111</serial>        
    </case>
    <case id="1" cont="2 | 2">
        <serial>111</serial>
    </case>
    <case id="4" cont="1 | 2">
        <serial>444</serial>        
    </case>
    <case id="4" cont="2 | 2">
        <serial>444</serial>
    </case>
</cases>

For above sample XML the expected output should be:

<?xml version="1.0" encoding="UTF-8"?>
<output>
   <id val="1">
      <duplicates>1</duplicates>
   </id>
   <id val="2">
      <duplicates>0</duplicates>
   </id>
   <id val="3">
      <duplicates>1</duplicates>
   </id>
   <id val="4">
      <duplicates>0</duplicates>
   </id>
</output>

Explanation:

  • A case will be considered duplicated if the same id is present in multiple cases but cont is empty (ref: case id=3)
  • A case will be considered unique if the same id is present in multiple cases but cont is not empty (ex: 1 | 2, 2 | 2) (ref: case id=4)
  • A case will be considered duplicated if the same id along with cont values are present in multiple cases (ref: case id=1)

Further explanation on duplicates:

The below is a duplicated because the same id appears two times and cont is blank:

<case id="1" cont="">
    <serial>111</serial>        
</case>
<case id="1" cont="">
    <serial>111</serial>
</case>

<output>
   <id val="1">
      <duplicates>1</duplicates>
   </id>
</output>

Now the below itself is not a duplicate because the same id can be in multiple pages/cases, and for that the same id along with cont has to be present:

<case id="1" cont="1 | 2">
    <serial>111</serial>        
</case>
<case id="1" cont="2 | 2">
    <serial>111</serial>
</case>

<output>
   <id val="1">
      <duplicates>0</duplicates>
   </id>
</output>

The above will be considered unique. Now, the above can also be duplicated if the same appears again, like below example:

<case id="1" cont="1 | 2">
    <serial>111</serial>        
</case>
<case id="1" cont="2 | 2">
    <serial>111</serial>
</case>
<case id="1" cont="1 | 2">
    <serial>111</serial>        
</case>
<case id="1" cont="2 | 2">
    <serial>111</serial>
</case>

<output>
   <id val="1">
      <duplicates>1</duplicates>
   </id>
</output> 

For above scenario, even though there are two <case id="1" cont="1 | 2"> and two <case id="1" cont="2 | 2"> the count of duplicates at the end is not two because that case id is the same but split in two. See below example:

(Case id=1 split in 3 pages - below 3 entries are considered only one - The entire block)
<case id="1" cont="1 | 3">
    <serial>111</serial>        
</case>
<case id="1" cont="2 | 3">
    <serial>111</serial>
</case>
<case id="1" cont="3 | 3">
    <serial>111</serial>
</case>

(Duplicated case id=1 same as above - This one (the entire block) is the one that counts as the duplicated)
<case id="1" cont="1 | 3">
    <serial>111</serial>        
</case>
<case id="1" cont="2 | 3">
    <serial>111</serial>
</case>
<case id="1" cont="3 | 3">
    <serial>111</serial>
</case>

<output>
   <id val="1">
      <duplicates>1</duplicates>
   </id>
</output>

How can I achieve this in either XSLT 1.0 or XSLT 2.0?


Solution

  • Assuming you have consistent input data (meaning the cont is either '' or the are consistent sequences of 1 | n, 2 | n, .., n | n), I would think it suffices to group the cases with cont being empty and the ones where n | n; with XSLT 3 that translates into e.g.

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema"
        exclude-result-prefixes="#all"
        expand-text="yes"
        version="3.0">
    
      <xsl:mode on-no-match="shallow-skip"/>
    
      <xsl:output method="xml" indent="yes" />
    
      <xsl:template match="cases">
        <output>
          <xsl:for-each-group 
              select="case[@cont = '' or count(distinct-values(tokenize(@cont, '\s*\|\s*'))) = 1]" 
              composite="yes" 
              group-by="if (@cont = '') then (@id, '') else (@id, tokenize(@cont, '\s*\|\s*')[2])">
            <val id="{@id}">
              <duplicates>{count(current-group()) - 1}</duplicates>
            </val>
          </xsl:for-each-group>
        </output>
      </xsl:template>
      
    </xsl:stylesheet>
    

    In XSLT 2:

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema"
        exclude-result-prefixes="#all"
        version="2.0">
       
      <xsl:output method="xml" indent="yes" />
    
      <xsl:template match="cases">
        <output>
          <xsl:for-each-group 
              select="case[@cont = '' or count(distinct-values(tokenize(@cont, '\s*\|\s*'))) = 1]" 
              group-by="@id">
            <xsl:for-each-group select="current-group()" group-by="if (@cont = '') then '' else tokenize(@cont, '\s*\|\s*')[2]">
              <val id="{@id}">
                <duplicates>
                  <xsl:value-of select="count(current-group()) - 1"/>                 
                </duplicates>
              </val>           
            </xsl:for-each-group>
          </xsl:for-each-group>
        </output>
      </xsl:template>
      
    </xsl:stylesheet>