Search code examples
sortinggroup-byduplicatesxslt-groupingxslt-3.0

XSLT: Transform XML Using Group, Duplicate Check and Sort


I have the following problem statement now. I will be having the following XML file as below:

    <?xml version="1.0" encoding="UTF-8"?>
        <EmpD>
          <PR>
            <RType>02</RType>
            <Emp>888</Emp>
          </PR>
          <PR>
            <RType>02</RType>
            <Emp>889</Emp>
          </PR>
          <JR>
            <RType>01</RType>
            <Emp>888</Emp>
            <Type>C</Type>
            <EDate>2020-05-01</EDate>
            <HR>1210148900</HR>
            <JobC>Test</JobC>
          </JR> 
          <JR>
            <RType>01</RType>
            <Emp>888</Emp>
            <Type>NC</Type>
            <EDate>2020-05-01</EDate>
            <HR>1210148900</HR>
            <JobC>Test</JobC>
          </JR> 
          <JR>
            <RType>01</RType>
            <Emp>888</Emp>
            <Type>C</Type>
            <EDate>2020-05-02</EDate>
            <HR>1210148900</HR>
            <JobC>Test</JobC>
          </JR>
          <JR>
            <RType>01</RType>
            <Emp>889</Emp>
            <Type>C</Type>
            <EDate>2020-05-01</EDate>
            <HR>1210148900</HR>
            <JobC>Test</JobC>
          </JR> 
          <JR>
            <RType>01</RType>
            <Emp>889</Emp>
            <Type>NC</Type>
            <EDate>2020-05-01</EDate>
            <HR>1210148900</HR>
            <JobC>Test</JobC>
          </JR> 
          <JR>
            <RType>01</RType>
            <Emp>889</Emp>
            <Type>NC</Type>
            <EDate>2020-05-02</EDate>
            <HR>1210148900</HR>
            <JobC>Test</JobC>
          </JR>  
        </EmpD>        
  1. So, basically the JR node here can be duplicate based on the Emp and EDate. Is there any possibility to check duplicate against the combination of Emp and EDate and then remove those?

  2. My final output XML should look like below, meaning it should sorted by Emp(both for PR and JR) and also with the EDate.

     <?xml version="1.0" encoding="UTF-8"?>
     <EmpD>
       <PR>
         <RType>02</RType>
         <Emp>888</Emp>
       </PR>
       <JR>
         <RType>01</RType>
         <Emp>888</Emp>
         <EDate>2020-05-01</EffectiveDate>
         <HR>1210148900</HR>
         <JobC>Test</JobC>
       </JR>
       <JR>
         <RType>01</RType>
         <Emp>888</Emp>
         <EDate>2020-05-02</EffectiveDate>
         <HR>1210148900</HR>
         <JobC>Test</JobC>
       </JR>
       <JR>
         <RType>01</RType>
         <Emp>889</Emp>
         <EDate>2020-05-01</EffectiveDate>
         <HR>1210148900</HR>
         <JobC>Test</JobC>
       </JR> 
       <PR>
         <RType>02</RType>
         <Emp>889</Emp>
       </PR>
       <JR>
         <RType>01</RType>
         <Emp>889</Emp>
         <EDate>2020-05-01</EffectiveDate>
         <HR>1210148900</HR>
         <JobC>Test</JobC>
       </JR>
       <JR>
         <RType>01</RType>
         <Emp>889</Emp>
         <EDate>2020-05-02</EffectiveDate>
         <HR>1210148900</HR>
         <JobC>Test</JobC>
       </JR>  
     <EmpD>
    
  3. Also, the Type field is important. We only need to consider where the Type value is "C".

  4. I need to finally create a CSV type content. Can we generate from here? Or it means, I need to first generate a XML and then convert into CSV?


Solution

  • You can use a composite key in XSLT 3 to eliminate duplicates, either with for-each-group or with the key function:

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema"
        exclude-result-prefixes="#all"
        version="3.0">
    
      <xsl:strip-space elements="*"/>
      <xsl:output indent="yes"/>
        
      <xsl:key name="dups" match="JR" composite="yes" use="Emp, EDate"/>
    
      <xsl:mode on-no-match="shallow-copy"/>
      
      <xsl:template match="EmpD">
          <xsl:copy>
              <xsl:apply-templates select="*">
                  <xsl:sort select="Emp"/>
              </xsl:apply-templates>
          </xsl:copy>
      </xsl:template>
      
      <xsl:template match="JR[not(. is key('dups', (Emp, EDate))[1])]"/>
      
    </xsl:stylesheet>
    

    https://xsltfiddle.liberty-development.net/jxDjimB