Search code examples
streamingsaxonxslt-3.0

Make sibling nodes have a parent in a streamable mode


Having a document with lots of sibling <Line> nodes as follows

<Report>
    <Date>2020-07-25</Date>
    <Number>12</Number>
    <Line>
        <LineNumber>1</LineNumber>
        <Description>Some text</Description>
        <Quantity>5</Quantity>
    </Line>
    <Line>
        <LineNumber>2</LineNumber>
        <Description>Some other text</Description>
        <Quantity>9</Quantity>
    </Line>
</Report>

I want get an output with such a nodes get combined into a single parent as

<INV>
    <HEAD>
        <DTM>2020-07-25</DTM>
        <ID>12</ID>
    </HEAD>
    <LINES>
        <LINE>
            <NUM>1</NUM>
            <DESC>Some text</DESC>
            <QTY>5</QTY>
        </LINE>
        <LINE>
            <NUM>2</NUM>
            <DESC>Some other text</DESC>
            <QTY>9</QTY>
        </LINE>
    </LINES>
</INV>

A possible solution to that problem is to group elements by their names

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="3.0">
    
    <xsl:mode streamable="yes" on-no-match="deep-skip"/>
    <xsl:mode name="non-streamable" on-no-match="shallow-skip"/>
    
    <xsl:template match="/Report">
        <xsl:element name="INV">
            <xsl:fork>
                <xsl:for-each-group select="*" group-by="name() = 'Line'">
                    <xsl:choose>
                        <xsl:when test="current-grouping-key()">
                            <xsl:element name="LINES">
                                <xsl:apply-templates select="current-group()/copy-of()" mode="non-streamable"/>                     
                            </xsl:element>
                        </xsl:when>
                        <xsl:otherwise>
                            <xsl:element name="HEAD">
                                <xsl:apply-templates select="current-group()/copy-of()" mode="non-streamable"/>                     
                            </xsl:element>
                        </xsl:otherwise>
                    </xsl:choose>
                </xsl:for-each-group>
            </xsl:fork>
        </xsl:element>
    </xsl:template>
    
    <xsl:template match="Date" mode="non-streamable">
        <DTM>
            <xsl:value-of select="."/>
        </DTM>
    </xsl:template>
    
    <xsl:template match="Number" mode="non-streamable">
        <ID>
            <xsl:value-of select="."/>
        </ID>
    </xsl:template>
    
    <xsl:template match="Line" mode="non-streamable">
        <LINE>
            <NUM>
                <xsl:value-of select="LineNumber"/>
            </NUM>
            <DESC>
                <xsl:value-of select="Description"/>
            </DESC>
            <QTY>
                <xsl:value-of select="Quantity"/>
            </QTY>
        </LINE>
    </xsl:template>

</xsl:stylesheet>

But using this approach I faced with a high memory consumption, it took about 2,5 GB of RAM to transform a real life 500 Mb document with about 1 million lines in it. Are these grouped elements stored in memory? Could we avoid it?
Is there another way to perform this task as well?


Solution

  • The xsl:for-each-group instruction with @group-by is streamable as far as the technical definition is concerned, because it can operate without having the full source document in memory; however, it constructs the selected groups in memory, so it can still have a high memory requirement. So this isn't the right approach.

    I think you're on the right lines here, but you can use group-adjacent rather than group-by, which makes it fully streamable. Here's my solution:

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="3.0">
      
      <xsl:mode streamable="yes" on-no-match="deep-skip"/>
      <xsl:mode name="non-streamable" on-no-match="shallow-skip"/>
      
      <xsl:template match="/Report">
        <xsl:element name="INV">
            <xsl:for-each-group select="*" group-adjacent="name() = 'Line'">
              <xsl:choose>
                <xsl:when test="current-grouping-key()">
                  <xsl:element name="LINES">
                    <xsl:apply-templates select="current-group()"/>                     
                  </xsl:element>
                </xsl:when>
                <xsl:otherwise>
                  <xsl:element name="HEAD">
                    <xsl:apply-templates select="current-group()/copy-of()" mode="non-streamable"/>                     
                  </xsl:element>
                </xsl:otherwise>
              </xsl:choose>
            </xsl:for-each-group>
        </xsl:element>
      </xsl:template>
      
      <xsl:template match="Date" mode="non-streamable">
        <DTM>
          <xsl:value-of select="."/>
        </DTM>
      </xsl:template>
      
      <xsl:template match="Number" mode="non-streamable">
        <ID>
          <xsl:value-of select="."/>
        </ID>
      </xsl:template>
      
      <xsl:template match="Line">
        <LINE>
          <xsl:apply-templates/>
        </LINE>
      </xsl:template>
      
      <xsl:template match="LineNumber">
          <NUM>
            <xsl:value-of select="."/>
          </NUM>
      </xsl:template>
      
      <xsl:template match="Description">
        <DESC>
          <xsl:value-of select="."/>
        </DESC>
      </xsl:template>
      
      <xsl:template match="Quantity">
        <QTY>
          <xsl:value-of select="."/>
        </QTY>
      </xsl:template>
      
    </xsl:stylesheet>
    

    I haven't tried it on a large input file, but I think it should operate in constant memory.