I have a large XML corpus document which has a structure that looks generally like the following:
<corpus>
<document n="001">
<front>
<title>foo title</title>
<group n="foo_group_A"/>
<front>
<body>
<seg n="1">some text with markups</seg>
<seg n="2">some text with markups</seg>
<seg n="3">some text with markups</seg>
</body>
</document>
<document n=002">
<front>
<title>foo title</title>
<group n="foo_group_A"/>
<front>
<body>
<seg n="1">some text with markups</seg>
<seg n="2">some text with markups</seg>
</body>
</document>
<document n="003">
<front>
<title>foo title</title>
<group n="foo_group_A"/>
<front>
<body>
<seg n="1">some text with markups</seg>
<seg n="2">some text with markups</seg>
<seg n="3">some text with markups</seg>
</body>
</document>
<document n="004">
<front>
<title>foo title</title>
<group n="foo_group_B"/>
<front>
<body>
<seg n="1">some text with markups</seg>
</body>
</document>
<document n="005">
<front>
<title>foo title</title>
<group n="foo_group_B"/>
<front>
<body>
<seg n="1">some text with markups</seg>
<seg n="2">some text with markups</seg>
</body>
</document>
[...]
</corpus>
I am pre-processing this XML file into a different format XML using XSL 3.0
before finally outputting to PDF. As part of the transformation, I want to collect and 'wrap' the <document>
's in a new <chapter>
element which reflects the value of front/group/@n
. The new corpus would look like the following, where the group/@n
value provides the logic for grouping under the new chapter
:
<corpus>
<chapter n="foo_group_A">
<document n="001">
<front>
<title>foo title</title>
<front>
<body>
<seg n="1">some text with markups</seg>
<seg n="2">some text with markups</seg>
<seg n="3">some text with markups</seg>
</body>
</document>
<document n=002">
<front>
<title>foo title</title>
<front>
<body>
<seg n="1">some text with markups</seg>
<seg n="2">some text with markups</seg>
</body>
</document>
<document n="003">
<front>
<title>foo title</title>
<front>
<body>
<seg n="1">some text with markups</seg>
<seg n="2">some text with markups</seg>
<seg n="3">some text with markups</seg>
</body>
</document>
</chapter>
<chapter n="foo_group_B">
<document n="004">
<front>
<title>foo title</title>
<front>
<body>
<seg n="1">some text with markups</seg>
</body>
</document>
<document n="005">
<front>
<title>foo title</title>
<front>
<body>
<seg n="1">some text with markups</seg>
<seg n="2">some text with markups</seg>
</body>
</document>
</chapter>
[...]
</corpus>
The file is already pre-sorted foo_group_A, foo_group_B, etc, so no extra sorting is necessary. It just requires creating a new element <chapter>
to contain the relevant documents. I've tried this with xsl:for-each
but I think I'm missing some sort of 'summary' or 'collection' of groups through which to iterate.
Many thanks in advance.
If you use XSLT 3 and want to group items then of course you don't use xsl:for-each
but xsl:for-each-group
instead e.g.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="3.0">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="corpus">
<xsl:copy>
<xsl:for-each-group select="document" group-by="front/group/@n">
<chapter n="{current-grouping-key()}">
<xsl:apply-templates select="current-group()"/>
</chapter>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
<xsl:template match="front/group"/>
</xsl:stylesheet>
http://xsltfiddle.liberty-development.net/nbUY4ki
If the document
s are already sorted by the grouping key front/group/@n
it should also suffice to use xsl:for-each-group select="document" group-adjacent="front/group/@n"
instead of above group-by
and that way it would then be easier to use streaming for huge documents by addding streamable="yes"
to the xsl:mode
declaration and using xsl:for-each-group select="copy-of(document)" group-adjacent="front/group/@n"
for the grouping.