Search code examples
xsltxslt-3.0

Splitting mixed content nodes on particular regex match with xslt 3


My simplified input looks like this:

<stuff>
    <p>CAPITALWORD is part of <i>mixed</i> content.</p>
    <p>ANOTHER is <i>here</i> but it's not the only one. SOMEWORDS are <i>mixted up</i> in the same
        paragraph. SOMETIMES even <i>multiple times.</i></p>
</stuff>

Now, my goal is to split paragraphs on each full-caps word. I thought I would go for grouping text starting with at least two capital letters like this:

<xsl:output method="xml" indent="true"></xsl:output>
<xsl:mode on-no-match="shallow-copy"/>
    
<xsl:template match="p">
  <xsl:for-each-group select="node()" group-starting-with="text()[matches(., '[A-Z]{2,}')]">
    <xsl:element name="p" >
      <xsl:apply-templates select="current-group()"/>
    </xsl:element>  
  </xsl:for-each-group>
</xsl:template>

but this won't work because I'm dealing with mixed content rather than strings only. So I get this:

<stuff>
   <p>CAPITALWORD is part of <i>mixed</i> content.</p>
   <p>ANOTHER is <i>here</i>
   </p>
   <p> but it's not the only one. SOMEWORDS are <i>mixed up</i> in the <i>same</i>
   </p>
   <p>
        paragraph. SOMETIMES even <i>multiple times.</i>
   </p>
</stuff>

instead of the desired output:

<stuff>
    <p>CAPITALWORD is part of <i>mixed</i> content. </p>
    <p>ANOTHER is <i>here</i> but it's not the only one. </p>
    <p>SOMEWORDS are <i>mixed up</i> in the <i>same</i> paragraph. </p>
    <p>SOMETIMES even <i>multiple times.</i></p>
</stuff>

I will be most grateful for tips on how to achieve the desired output.


Solution

  • One approach is a two step transformation, the first step uses analyze-string on text nodes to wrap your capitalized word into an element, the second step then can easily use group-starting-with on those wrapper elements:

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0"
      xmlns:xs="http://www.w3.org/2001/XMLSchema"
      xmlns:fn="http://www.w3.org/2005/xpath-functions"
      exclude-result-prefixes="#all"
      expand-text="yes">
    
      <xsl:mode on-no-match="shallow-copy"/>
      
      <xsl:template match="p">
        <xsl:variable name="capitalized-marked-up" as="node()*">
          <xsl:apply-templates mode="markup-capitalized"/>
        </xsl:variable>
        <xsl:for-each-group select="$capitalized-marked-up" group-starting-with="capitalized-word">
          <p>
            <xsl:apply-templates select="current-group()"/>
          </p>
        </xsl:for-each-group>
      </xsl:template>
      
      <xsl:template match="capitalized-word">
        <xsl:apply-templates/>
      </xsl:template>
      
      <xsl:mode name="markup-capitalized" on-no-match="shallow-copy"/>
      
      <xsl:template mode="markup-capitalized" match="text()">
        <xsl:apply-templates select="analyze-string(., '\p{Lu}{2,}')" mode="wrap"/>
      </xsl:template>
      
      <xsl:template mode="wrap" match="fn:match">
        <capitalized-word>{.}</capitalized-word>
      </xsl:template>
    
      <xsl:output indent="yes"/>
    
    </xsl:stylesheet>