Search code examples
xmlxsltxsl-fo

XSL creating 'chapters' or 'groups' from similar tagged entries


I have a large XML corpus document which has a structure that looks generally like the following:

<corpus>
   <document n="001">
       <front>
          <title>foo title</title>
          <group n="foo_group_A"/>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
          <seg n="2">some text with markups</seg>
          <seg n="3">some text with markups</seg>
       </body>
   </document>
   <document n=002">
       <front>
          <title>foo title</title>
          <group n="foo_group_A"/>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
          <seg n="2">some text with markups</seg>
       </body>
   </document>
   <document n="003">
       <front>
          <title>foo title</title>
          <group n="foo_group_A"/>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
          <seg n="2">some text with markups</seg>
          <seg n="3">some text with markups</seg>
       </body>
   </document>
   <document n="004">
       <front>
          <title>foo title</title>
          <group n="foo_group_B"/>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
       </body>
   </document>
   <document n="005">
       <front>
          <title>foo title</title>
          <group n="foo_group_B"/>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
          <seg n="2">some text with markups</seg>
       </body>
   </document>
    [...]
</corpus>

I am pre-processing this XML file into a different format XML using XSL 3.0 before finally outputting to PDF. As part of the transformation, I want to collect and 'wrap' the <document>'s in a new <chapter> element which reflects the value of front/group/@n . The new corpus would look like the following, where the group/@n value provides the logic for grouping under the new chapter:

<corpus>
  <chapter n="foo_group_A">
   <document n="001">
       <front>
          <title>foo title</title>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
          <seg n="2">some text with markups</seg>
          <seg n="3">some text with markups</seg>
       </body>
   </document>
   <document n=002">
       <front>
          <title>foo title</title>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
          <seg n="2">some text with markups</seg>
       </body>
   </document>
   <document n="003">
       <front>
          <title>foo title</title>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
          <seg n="2">some text with markups</seg>
          <seg n="3">some text with markups</seg>
       </body>
   </document>
  </chapter>
  <chapter n="foo_group_B">
   <document n="004">
       <front>
          <title>foo title</title>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
       </body>
   </document>
   <document n="005">
       <front>
          <title>foo title</title>
       <front>
       <body>
          <seg n="1">some text with markups</seg>
          <seg n="2">some text with markups</seg>
       </body>
   </document>
  </chapter>
    [...]
</corpus>

The file is already pre-sorted foo_group_A, foo_group_B, etc, so no extra sorting is necessary. It just requires creating a new element <chapter>to contain the relevant documents. I've tried this with xsl:for-each but I think I'm missing some sort of 'summary' or 'collection' of groups through which to iterate.

Many thanks in advance.


Solution

  • If you use XSLT 3 and want to group items then of course you don't use xsl:for-each but xsl:for-each-group instead e.g.

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema"
        exclude-result-prefixes="xs"
        version="3.0">
    
      <xsl:mode on-no-match="shallow-copy"/>
    
      <xsl:output method="xml" indent="yes"/>
      <xsl:strip-space elements="*"/>
    
      <xsl:template match="corpus">
          <xsl:copy>
              <xsl:for-each-group select="document" group-by="front/group/@n">
                  <chapter n="{current-grouping-key()}">
                      <xsl:apply-templates select="current-group()"/>
                  </chapter>
              </xsl:for-each-group>
          </xsl:copy>
      </xsl:template>  
    
      <xsl:template match="front/group"/>
    
    </xsl:stylesheet>
    

    http://xsltfiddle.liberty-development.net/nbUY4ki

    If the documents are already sorted by the grouping key front/group/@n it should also suffice to use xsl:for-each-group select="document" group-adjacent="front/group/@n" instead of above group-by and that way it would then be easier to use streaming for huge documents by addding streamable="yes" to the xsl:mode declaration and using xsl:for-each-group select="copy-of(document)" group-adjacent="front/group/@n" for the grouping.