Search code examples
xsltmuenchian-grouping

XSLT 1.0: grouping and removing duplicate


I have a xml grouping challenge for which I need to group AND remove duplicate as below:

<Person>
<name>John</name>
<date>June12</date>
<workTime taskID=1>34</workTime>
<workTime taskID=1>35</workTime>
<workTime taskID=2>12</workTime>
</Person>
<Person>
<name>John</name>
<date>June13</date>
<workTime taskID=1>21</workTime>
<workTime taskID=2>11</workTime>
<workTime taskID=2>14</workTime>
</Person>

Note that for a specific occurence of name/taskID/date, only the first one is picked up. In this example,

<workTime taskID=1>35</workTime> 
<workTime taskID=2>14</workTime> 

would be left aside.

Below is the expected output:

<Person>
<name>John</name>
<taskID>1</taskID>
<workTime>
<date>June12</date>
<time>34</time>
</worTime>
<workTime>
<date>June13</date>
<time>21</time>
</worTime>
</Person>
<Person>
<name>John</name>
<taskID>2</taskID>
<workTime>
<date>June12</date>
<time>12</time>
</worTime>
<workTime>
<date>June13</date>
<time>11</time>
</worTime>
</Person>

I have tried to use a muenchian grouping in XSLT 1.0 using the key below:

<xsl:key name="PersonTasks" match="workTime" use="concat(@taskID, ../name)"/>

but then how do I only pick up the first occurence of

concat(@taskID, ../name, ../date)

? It seems that I need two level of keys!?


Solution

  • This transformation:

    <xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
     <xsl:output omit-xml-declaration="yes" indent="yes"/>
    
     <xsl:key name="kwrkTimeByNameTask" match="workTime"
      use="concat(../name, '+', @taskID)"/>
    
     <xsl:key name="kDateByName" match="date"
      use="../name"/>
    
     <xsl:key name="kwrkTimeByNameTaskDate" match="workTime"
      use="concat(../name, '+', @taskID, '+', ../date)"/>
    
     <xsl:template match="/">
       <xsl:for-each select=
        "*/*/workTime
               [generate-id()
               =
                generate-id(key('kwrkTimeByNameTask',
                                 concat(../name, '+', @taskID)
                                )[1]
                            )
               ]
        ">
          <xsl:sort select="../name"/>
          <xsl:sort select="@taskID" data-type="number"/>
    
          <xsl:variable name="vcurTaskId" select="@taskID"/>
          <Person>
            <name><xsl:value-of select="../name"/></name>
            <taskID><xsl:value-of select="@taskID"/></taskID>
    
              <xsl:for-each select=
               "key('kDateByName', ../name)
                      [key('kwrkTimeByNameTaskDate',
                           concat(../name, '+', current()/@taskID, '+', .)
                          )
                      ]
               ">
                 <workTime>
                   <date><xsl:value-of select="."/></date>
                   <time>
                    <xsl:value-of select=
                     "key('kwrkTimeByNameTaskDate',
                      concat(../name, '+', $vcurTaskId, '+', .)
                     )"/>
                   </time>
                 </workTime>
              </xsl:for-each>
          </Person>
       </xsl:for-each>
     </xsl:template>
    </xsl:stylesheet>
    

    when applied on the provided XML (corrected from multiple issues to become well-formed):

    <t>
        <Person>
            <name>John</name>
            <date>June12</date>
            <workTime taskID="1">34</workTime>
            <workTime taskID="1">35</workTime>
            <workTime taskID="2">12</workTime>
        </Person>
        <Person>
            <name>John</name>
            <date>June13</date>
            <workTime taskID="1">21</workTime>
            <workTime taskID="2">11</workTime>
            <workTime taskID="2">14</workTime>
        </Person>
    </t>
    

    produces the wanted, correct result:

    <Person>
       <name>John</name>
       <taskID>1</taskID>
       <workTime>
          <date>June12</date>
          <time>34</time>
       </workTime>
       <workTime>
          <date>June13</date>
          <time>21</time>
       </workTime>
    </Person>
    <Person>
       <name>John</name>
       <taskID>2</taskID>
       <workTime>
          <date>June12</date>
          <time>12</time>
       </workTime>
       <workTime>
          <date>June13</date>
          <time>11</time>
       </workTime>
    </Person>
    

    Explanation:

    1. First we obtain all workTime elements with unique pairs of ../name, @taskID by using the Muenchian method for grouping.

    2. We sort these by ../name and @taskID -- in that order.

    3. For each such workTime we get all date elements that are listed with the ../name of this workTime and leave only those of these date elements, for which there is a workTime that has the same ../date and ../name.

    4. In the previous step we use two different auxiliary keys: 'kDateByName' indexes all date elements by their ../name, while 'kwrkTimeByNameTaskDate' indexes all workTime elements by their ../name, their ../date and their @taskID.

    So, the meaning of the following:

              <xsl:for-each select=
               "key('kDateByName', ../name)
                      [key('kwrkTimeByNameTaskDate',
                           concat(../name, '+', current()/@taskID, '+', .)
                          )
                      ]
               ">
    

    is:

    For each date for that name, such that a workTime for that name, date and @taskID (of the current workTime for the outer <xsl:for-each>) exists, do whatever is in the body of this <xsl:for-each> instruction.