I have below source XML that I'm able to group by id and count the duplicates:
<?xml version="1.0" encoding="utf-8"?>
<cases>
<case id="1" cont="">
<serial>111</serial>
</case>
<case id="1" cont="">
<serial>111</serial>
</case>
<case id="2" cont="">
<serial>222</serial>
</case>
</cases>
XSLT 1.0
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:key name="caseKey" match="case" use="@id"/>
<xsl:template match="cases">
<output>
<xsl:apply-templates select="@*|case[generate-id()=generate-id(key('caseKey', @id)[1])]"/>
</output>
</xsl:template>
<xsl:template match="case">
<xsl:element name="id">
<xsl:attribute name="val"><xsl:value-of select="@id"></xsl:value-of></xsl:attribute>
<xsl:element name="duplicates">
<xsl:value-of select="count(key('caseKey', @id))-1"></xsl:value-of>
</xsl:element>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
XSLT 2.0
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/cases">
<output>
<xsl:for-each-group select="case" group-by="@id">
<xsl:element name="id">
<xsl:attribute name="val"><xsl:value-of select="@id"></xsl:value-of></xsl:attribute>
<xsl:element name="duplicates">
<xsl:value-of select="count(current-group())-1"></xsl:value-of>
</xsl:element>
</xsl:element>
</xsl:for-each-group>
</output>
</xsl:template>
</xsl:stylesheet>
Output:
<?xml version="1.0" encoding="UTF-8"?>
<output>
<id val="1">
<duplicates>1</duplicates>
</id>
<id val="2">
<duplicates>0</duplicates>
</id>
</output>
Now, my challenge is that one case can continue in another case and for that cont
attribute will have values like 1 | 2
and 2 | 2
making that case unique, so far I haven't take into consideration the cont
attribute for the key, but now I think I have to:
<?xml version="1.0" encoding="utf-8"?>
<cases>
<case id="1" cont="1 | 2">
<serial>111</serial>
</case>
<case id="1" cont="2 | 2">
<serial>111</serial>
</case>
<case id="2" cont="">
<serial>222</serial>
</case>
<case id="3" cont="">
<serial>333</serial>
</case>
<case id="3" cont="">
<serial>333</serial>
</case>
<case id="1" cont="1 | 2">
<serial>111</serial>
</case>
<case id="1" cont="2 | 2">
<serial>111</serial>
</case>
<case id="4" cont="1 | 2">
<serial>444</serial>
</case>
<case id="4" cont="2 | 2">
<serial>444</serial>
</case>
</cases>
For above sample XML the expected output should be:
<?xml version="1.0" encoding="UTF-8"?>
<output>
<id val="1">
<duplicates>1</duplicates>
</id>
<id val="2">
<duplicates>0</duplicates>
</id>
<id val="3">
<duplicates>1</duplicates>
</id>
<id val="4">
<duplicates>0</duplicates>
</id>
</output>
Explanation:
id
is present in multiple cases but cont
is empty (ref: case id=3)id
is present in multiple cases but cont
is not empty (ex: 1 | 2, 2 | 2) (ref: case id=4)id
along with cont
values are present in multiple cases (ref: case id=1)Further explanation on duplicates:
The below is a duplicated because the same id appears two times and cont is blank:
<case id="1" cont="">
<serial>111</serial>
</case>
<case id="1" cont="">
<serial>111</serial>
</case>
<output>
<id val="1">
<duplicates>1</duplicates>
</id>
</output>
Now the below itself is not a duplicate because the same id can be in multiple pages/cases, and for that the same id along with cont has to be present:
<case id="1" cont="1 | 2">
<serial>111</serial>
</case>
<case id="1" cont="2 | 2">
<serial>111</serial>
</case>
<output>
<id val="1">
<duplicates>0</duplicates>
</id>
</output>
The above will be considered unique. Now, the above can also be duplicated if the same appears again, like below example:
<case id="1" cont="1 | 2">
<serial>111</serial>
</case>
<case id="1" cont="2 | 2">
<serial>111</serial>
</case>
<case id="1" cont="1 | 2">
<serial>111</serial>
</case>
<case id="1" cont="2 | 2">
<serial>111</serial>
</case>
<output>
<id val="1">
<duplicates>1</duplicates>
</id>
</output>
For above scenario, even though there are two <case id="1" cont="1 | 2">
and two <case id="1" cont="2 | 2">
the count of duplicates at the end is not two because that case id is the same but split in two. See below example:
(Case id=1 split in 3 pages - below 3 entries are considered only one - The entire block)
<case id="1" cont="1 | 3">
<serial>111</serial>
</case>
<case id="1" cont="2 | 3">
<serial>111</serial>
</case>
<case id="1" cont="3 | 3">
<serial>111</serial>
</case>
(Duplicated case id=1 same as above - This one (the entire block) is the one that counts as the duplicated)
<case id="1" cont="1 | 3">
<serial>111</serial>
</case>
<case id="1" cont="2 | 3">
<serial>111</serial>
</case>
<case id="1" cont="3 | 3">
<serial>111</serial>
</case>
<output>
<id val="1">
<duplicates>1</duplicates>
</id>
</output>
How can I achieve this in either XSLT 1.0 or XSLT 2.0?
Assuming you have consistent input data (meaning the cont
is either ''
or the are consistent sequences of 1 | n
, 2 | n
, .., n | n
), I would think it suffices to group the case
s with cont
being empty and the ones where n | n
; with XSLT 3 that translates into e.g.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
expand-text="yes"
version="3.0">
<xsl:mode on-no-match="shallow-skip"/>
<xsl:output method="xml" indent="yes" />
<xsl:template match="cases">
<output>
<xsl:for-each-group
select="case[@cont = '' or count(distinct-values(tokenize(@cont, '\s*\|\s*'))) = 1]"
composite="yes"
group-by="if (@cont = '') then (@id, '') else (@id, tokenize(@cont, '\s*\|\s*')[2])">
<val id="{@id}">
<duplicates>{count(current-group()) - 1}</duplicates>
</val>
</xsl:for-each-group>
</output>
</xsl:template>
</xsl:stylesheet>
In XSLT 2:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
version="2.0">
<xsl:output method="xml" indent="yes" />
<xsl:template match="cases">
<output>
<xsl:for-each-group
select="case[@cont = '' or count(distinct-values(tokenize(@cont, '\s*\|\s*'))) = 1]"
group-by="@id">
<xsl:for-each-group select="current-group()" group-by="if (@cont = '') then '' else tokenize(@cont, '\s*\|\s*')[2]">
<val id="{@id}">
<duplicates>
<xsl:value-of select="count(current-group()) - 1"/>
</duplicates>
</val>
</xsl:for-each-group>
</xsl:for-each-group>
</output>
</xsl:template>
</xsl:stylesheet>