Search code examples
performancexsltxslt-2.0xslt-3.0

XSLT Performance Help for High Volume Data Set


I'm running an XSLT to transform a very high volume XML input (multiple millions of lines) and trying to make the transformation more efficient.

My input data looks something like this:

<root>
    <entry>
        <groupID>123</groupID>
        <primary>true</primary>
        <date>2023-01-31</date>
        <lineID>
           <ID type='index'>ABC</ID>
           <ID type='objID'>0110</ID>
        </lineID>
    </entry>
    <entry>
        <groupID>123</groupID>
        <primary>true</primary>
        <date>2023-01-31</date>
        <lineID>
           <ID type='index'>ABC</ID>
           <ID type='objID'>0110</ID>
        </lineID>
    </entry>
    <entry>
        <groupID>123</groupID>
        <primary>false</primary>
        <date>2023-01-31</date>
        <lineID>
           <ID type='index'>XYZ</ID>
           <ID type='objID'>0221</ID>
        </lineID>
    </entry>
    <entry>
        <groupID>789</groupID>
        <primary>false</primary>
        <date>2023-01-01</date>
        <lineID>
           <ID type='index'>087</ID>
           <ID type='objID'>0330</ID>
        </lineID>
    </entry>
    <entry>
        <groupID>789</groupID>
        <primary>false</primary>
        <date>2023-01-01</date>
        <lineID>
           <ID type='index'>087</ID>
           <ID type='objID'>0330</ID>
        </lineID>
    </entry>
</root>

I want to update the lineID to match against the primary lineID within the same group and copy over all the XML. So the output would look like this.

<root>
    <entry>
        <groupID>123</groupID>
        <primary>true</primary>
        <date>2023-01-31</date>
        <lineID>
           <ID type='index'>ABC</ID>
        </lineID>
    </entry>
    <entry>
        <groupID>123</groupID>
        <primary>true</primary>
        <date>2023-01-31</date>
        <lineID>
           <ID type='index'>ABC</ID>
        </lineID>
    </entry>
    <entry>
        <groupID>123</groupID>
        <primary>false</primary>
        <date>2023-01-31</date>
        <lineID>
           <ID type='index'>ABC</ID>
        </lineID>
    </entry>
    <entry>
        <groupID>789</groupID>
        <primary>false</primary>
        <date>2023-01-01</date>
        <lineID>
           <ID type='index'>087</ID>
        </lineID>
    </entry>
    <entry>
        <groupID>789</groupID>
        <primary>false</primary>
        <date>2023-01-01</date>
        <lineID>
           <ID type='index'>087</ID>
        </lineID>
    </entry>
</root>

This is my xslt and it is working, but it's a bit slow. I just can't figure our how to write a more efficient one. Any pointers, suggestions, or edits appreciated! I know the exists() function is not particularly efficient and I've considered going to 3.0, but I couldn't get the stream-able mode or input doc to work correctly.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs"
    version="2.0">
    
    <xsl:output indent="yes"/>
    
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
    
    <xsl:template match='lineID'>
        <xsl:variable name='ID' select='.'/>
        <xsl:variable name='groupID' select='../groupID'/>
        
        <lineID>            
            <xsl:choose>
                <xsl:when test='exists(../../entry[groupID = $groupID and primary = true() and lineID != $ID])'>
                        <xsl:value-of select='../../entry[groupID = $groupID and primary = true() and lineID != $ID][1]/lineID'/>
                    </xsl:when>
                <xsl:otherwise>
                    <xsl:value-of select='$ID'/>
                </xsl:otherwise>
            </xsl:choose>
        </lineID>
    </xsl:template>
</xsl:stylesheet>

Solution

  • I like the idea of using streaming with group-adjacent; given that the values of the entry elements are in child elements you will need to use copy-of():

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"
      xmlns:xs="http://www.w3.org/2001/XMLSchema"
      exclude-result-prefixes="#all"
      expand-text="yes">
    
      <xsl:mode on-no-match="shallow-copy" streamable="yes"/>
    
      <xsl:template match="root">
        <xsl:copy>
          <xsl:for-each-group select="entry!copy-of()" group-adjacent="groupID">
            <xsl:apply-templates select="current-group()" mode="grounded">
              <xsl:with-param name="lineID" tunnel="yes" select="current-group()[primary = 'true'][1]/lineID"/>
            </xsl:apply-templates>
          </xsl:for-each-group>
        </xsl:copy>
      </xsl:template>
      
      <xsl:mode name="grounded" on-no-match="shallow-copy"/>
      
      <xsl:template match="lineID" mode="grounded">
        <xsl:param name="lineID" tunnel="yes"/>
        <xsl:copy>{($lineID, .)[1]}</xsl:copy>
      </xsl:template>
      
      <xsl:output indent="yes"/>
    
    </xsl:stylesheet>
    

    An accumulator alone will not help in my view if I understand your data right as it is not clear at which position (at all) your "primary" lineID occurs.