Search code examples
xmlxsltsql-server-openxml

Split giant XML file into n-child versions


For example the giant file has 50 million lines of such:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
  <activity>
    <deliv>
      <subitem1>text</subitem1>
      <subitem2>text</subitem2>
    </deliv>
    <deliv>
      <subitem1>text</subitem1>
      <subitem2>text</subitem2>
    </deliv>
    <deliv>
      <subitem1>text</subitem1>
      <subitem2>text</subitem2>
    </deliv>
  </activity>
</root>

And each 'child' file would have the same structure, but be 5 million lines or so, or 1/10th of the original.

The reason for this is to make the import of such into a database more manageable, without blowing out the memory (SQL Server's OPENXML).

Is XSLT the best choice here?


Solution

  • The Enterprise Edition of Saxon 9.8 (Saxon 9.8 EE) supports the streaming feature of the one year old XSLT 3.0 specification which allows you to use a subset of XSLT to read through an XML documents in a forwards only way, using only the memory necessary to store the currently visited node and its ancestors.

    Using that approach you can write code like for-each-group select="activity/deliv" group-adjacent="(position() - 1) idiv $size" to do a positional grouping that reads through the file deliv by deliv element and collects them into groups of $size:

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema"
        xmlns:math="http://www.w3.org/2005/xpath-functions/math"
        exclude-result-prefixes="xs math"
        version="3.0">
    
        <xsl:param name="size" as="xs:integer" select="1000"/>
    
        <xsl:mode on-no-match="shallow-copy" streamable="yes"/>
    
        <xsl:template match="root">
            <xsl:for-each-group select="activity/deliv" group-adjacent="(position() - 1) idiv $size">
                <xsl:result-document href="split-{format-number(current-grouping-key() + 1, '00000')}.xml" indent="yes">
                    <root>
                        <activity>
                            <xsl:copy-of select="current-group()"/>
                        </activity>
                    </root>
                </xsl:result-document>
            </xsl:for-each-group>
        </xsl:template>
    
    </xsl:stylesheet>
    

    That splits up your input into a number of files, each file having $size deliv elements (respectively the last one the remaining deliv elements if there are less than $size left).

    Using Saxon EE requires obtaining a commercial license but trial licences exist.