I'm running an XSLT to transform a very high volume XML input (multiple millions of lines) and trying to make the transformation more efficient.
My input data looks something like this:
<root>
<entry>
<groupID>123</groupID>
<primary>true</primary>
<date>2023-01-31</date>
<lineID>
<ID type='index'>ABC</ID>
<ID type='objID'>0110</ID>
</lineID>
</entry>
<entry>
<groupID>123</groupID>
<primary>true</primary>
<date>2023-01-31</date>
<lineID>
<ID type='index'>ABC</ID>
<ID type='objID'>0110</ID>
</lineID>
</entry>
<entry>
<groupID>123</groupID>
<primary>false</primary>
<date>2023-01-31</date>
<lineID>
<ID type='index'>XYZ</ID>
<ID type='objID'>0221</ID>
</lineID>
</entry>
<entry>
<groupID>789</groupID>
<primary>false</primary>
<date>2023-01-01</date>
<lineID>
<ID type='index'>087</ID>
<ID type='objID'>0330</ID>
</lineID>
</entry>
<entry>
<groupID>789</groupID>
<primary>false</primary>
<date>2023-01-01</date>
<lineID>
<ID type='index'>087</ID>
<ID type='objID'>0330</ID>
</lineID>
</entry>
</root>
I want to update the lineID to match against the primary lineID within the same group and copy over all the XML. So the output would look like this.
<root>
<entry>
<groupID>123</groupID>
<primary>true</primary>
<date>2023-01-31</date>
<lineID>
<ID type='index'>ABC</ID>
</lineID>
</entry>
<entry>
<groupID>123</groupID>
<primary>true</primary>
<date>2023-01-31</date>
<lineID>
<ID type='index'>ABC</ID>
</lineID>
</entry>
<entry>
<groupID>123</groupID>
<primary>false</primary>
<date>2023-01-31</date>
<lineID>
<ID type='index'>ABC</ID>
</lineID>
</entry>
<entry>
<groupID>789</groupID>
<primary>false</primary>
<date>2023-01-01</date>
<lineID>
<ID type='index'>087</ID>
</lineID>
</entry>
<entry>
<groupID>789</groupID>
<primary>false</primary>
<date>2023-01-01</date>
<lineID>
<ID type='index'>087</ID>
</lineID>
</entry>
</root>
This is my xslt and it is working, but it's a bit slow. I just can't figure our how to write a more efficient one. Any pointers, suggestions, or edits appreciated! I know the exists() function is not particularly efficient and I've considered going to 3.0, but I couldn't get the stream-able mode or input doc to work correctly.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs"
version="2.0">
<xsl:output indent="yes"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match='lineID'>
<xsl:variable name='ID' select='.'/>
<xsl:variable name='groupID' select='../groupID'/>
<lineID>
<xsl:choose>
<xsl:when test='exists(../../entry[groupID = $groupID and primary = true() and lineID != $ID])'>
<xsl:value-of select='../../entry[groupID = $groupID and primary = true() and lineID != $ID][1]/lineID'/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select='$ID'/>
</xsl:otherwise>
</xsl:choose>
</lineID>
</xsl:template>
</xsl:stylesheet>
I like the idea of using streaming with group-adjacent; given that the values of the entry
elements are in child elements you will need to use copy-of()
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
expand-text="yes">
<xsl:mode on-no-match="shallow-copy" streamable="yes"/>
<xsl:template match="root">
<xsl:copy>
<xsl:for-each-group select="entry!copy-of()" group-adjacent="groupID">
<xsl:apply-templates select="current-group()" mode="grounded">
<xsl:with-param name="lineID" tunnel="yes" select="current-group()[primary = 'true'][1]/lineID"/>
</xsl:apply-templates>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
<xsl:mode name="grounded" on-no-match="shallow-copy"/>
<xsl:template match="lineID" mode="grounded">
<xsl:param name="lineID" tunnel="yes"/>
<xsl:copy>{($lineID, .)[1]}</xsl:copy>
</xsl:template>
<xsl:output indent="yes"/>
</xsl:stylesheet>
An accumulator alone will not help in my view if I understand your data right as it is not clear at which position (at all) your "primary" lineID occurs.