For example the giant file has 50 million lines of such:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
<activity>
<deliv>
<subitem1>text</subitem1>
<subitem2>text</subitem2>
</deliv>
<deliv>
<subitem1>text</subitem1>
<subitem2>text</subitem2>
</deliv>
<deliv>
<subitem1>text</subitem1>
<subitem2>text</subitem2>
</deliv>
</activity>
</root>
And each 'child' file would have the same structure, but be 5 million lines or so, or 1/10th of the original.
The reason for this is to make the import of such into a database more manageable, without blowing out the memory (SQL Server's OPENXML).
Is XSLT the best choice here?
The Enterprise Edition of Saxon 9.8 (Saxon 9.8 EE) supports the streaming feature of the one year old XSLT 3.0 specification which allows you to use a subset of XSLT to read through an XML documents in a forwards only way, using only the memory necessary to store the currently visited node and its ancestors.
Using that approach you can write code like for-each-group select="activity/deliv" group-adjacent="(position() - 1) idiv $size"
to do a positional grouping that reads through the file deliv
by deliv
element and collects them into groups of $size
:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
exclude-result-prefixes="xs math"
version="3.0">
<xsl:param name="size" as="xs:integer" select="1000"/>
<xsl:mode on-no-match="shallow-copy" streamable="yes"/>
<xsl:template match="root">
<xsl:for-each-group select="activity/deliv" group-adjacent="(position() - 1) idiv $size">
<xsl:result-document href="split-{format-number(current-grouping-key() + 1, '00000')}.xml" indent="yes">
<root>
<activity>
<xsl:copy-of select="current-group()"/>
</activity>
</root>
</xsl:result-document>
</xsl:for-each-group>
</xsl:template>
</xsl:stylesheet>
That splits up your input into a number of files, each file having $size
deliv
elements (respectively the last one the remaining deliv
elements if there are less than $size
left).
Using Saxon EE requires obtaining a commercial license but trial licences exist.