Search code examples
xsltxslt-1.0xslt-grouping

XSLT Muenchian Grouping on different elements based on a common attribute


I am given XML similar to the following that I need to process.

<root>
    <Header/>
    <Customer id="1" date="13/04/2014"/>
    <Account id="1" date="14/04/2014"/>
    <Account id="1" date="01/06/2015"/>
    <Address id="1" date="14/04/2014"/>
    <Customer id="2" date="12/08/2015"/>
    <Account id="2" date="13/08/2015"/>
    <Address id="2" date="13/08/2015"/>
    <Address id="2" date="03/09/2015"/>
    <Address id="2" date="27/01/2017"/>
    <Customer id="3" date="04/10/2015"/>
    <Customer id="3" date="01/02/2017"/>
    <Account id="3" date="05/10/2015"/>
    <Address id="3" date="08/10/2015"/>
    <Address id="3" date="03/09/2016"/>
</root>

All of the nodes have more attributes but I stripped them off. Each element has an id and a date.If there are duplicate elements that have the same id then the one with the most recent date is considered valid and the older one should be ignored.

If the older ones can be stripped out at the same time I would like to output it into something like this.

<Customers>
    <Customer id="1">
        <Account/>
        <Address/>
    </Customer>
    <Customer id="2">
        <Account/>
        <Address/>
    </Customer>
    <Customer id="3">
        <Account/>
        <Address/>
    </Customer>
</Customers>

If not then it is fine to process the file in two transforms (one to group them by customer id and each customer have multiple Account/Address fields, then in the other transform remove the older entries)

The source XML has close to a million entries so performance is an issue. The transform taking a few minutes is fine, but any more than 15 will not work.

I currently have the following XSLT

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

    <xsl:key name="nodes-by-id" match="//root/*" use="@id"/>

    <xsl:template match="root">
        <Customers>
            <xsl:for-each select="*[count(. | key('nodes-by-id', @id)[1]) = 1]">
                <xsl:variable name="current-grouping-key" select="@id"/>
                <xsl:variable name="current-group" select="key('nodes-by-id', $current-grouping-key)"/>
                <Customer>
                    <xsl:attribute name="id">
                        <xsl:value-of select="$current-grouping-key"/>
                    </xsl:attribute>
                    <CustomerElements>
                        <xsl:for-each select="$current-group/Customer">
                            <CustomerElement>
                                <xsl:attribute name="date">
                                    <xsl:value-of select="@date"/>
                                </xsl:attribute>
                            </CustomerElement>
                        </xsl:for-each>
                    </CustomerElements>
                    <xsl:apply-templates select="$current-group"/>
                </Customer>
            </xsl:for-each>
        </Customers>
    </xsl:template>
</xsl:stylesheet>

Currently this just tries to group all of the elements by their id, then output all of the Customer elements. I get the following:

<Customers>
    <Customer id="">
        <CustomerElements/>
    </Customer>
    <Customer id="1">
        <CustomerElements/>
    </Customer>
    <Customer id="2">
        <CustomerElements/>
    </Customer>
    <Customer id="3">
        <CustomerElements/>
    </Customer>
</Customers>

I get the customer with the blank ID because I don't ignore the header row. My real question is why does the $current-group variable not contain any elements?

Also any tips on how to ignore the header row, and to filter out entries with the older dates.


Solution

  • I got everything sorted. This is a segment of the XSLT I used. More info in the XML comments.

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
    
        <xsl:key name="nodes-by-id" match="//root/*" use="@id"/>
    
        <xsl:template match="PR-030">
            <CustomerMeters>
            <!-- Using select="Customer[cou.... instead of select="*[cou... will couse it to ignore the header. However it requres
                the Customer element to be the first element for the icp in the xml. -->
                <xsl:for-each select="Customer[count(. | key('nodes-by-id', @id)[1]) = 1]">
                    <xsl:variable name="current-grouping-key" select="@id"/>
                    <xsl:variable name="current-group" select="key('nodes-by-id', $current-grouping-key)"/>
    
                    <xsl:variable name="current-group-sorted">
                        <!-- If we sort all nodes by date order, then we can fetch the first Address/Customer/etc... from this group and we will have the latest-->
                        <xsl:for-each select="$current-group">
                            <!-- year -->
                            <xsl:sort select="substring(@date, 7, 4)" order="descending" data-type="number"/>
                            <!-- month -->
                            <xsl:sort select="substring(@date, 4, 2)" order="descending" data-type="number"/>
                            <!-- day -->
                            <xsl:sort select="substring(@date, 1, 2)" order="descending" data-type="number"/>
                            <xsl:copy-of select="current()" />
                        </xsl:for-each>
                    </xsl:variable>
                    <Customer>
                        <!-- In here I can get what I want from the current-group-sorted varaible-->
                        <!-- Because they are in date order I can just get the first occurance and it will be the most recent-->
                        <someField>
                            <xsl:value-of select="$current-group-sorted/*[self::Account][1]/@someAttribute"/>
                        </someField>
                    </Customer>
                </xsl:for-each>
            </CustomerMeters>
        </xsl:template>
    </xsl:stylesheet>