Search code examples

Splitting mixed content nodes on particular regex match with xslt 3

My simplified input looks like this:

    <p>CAPITALWORD is part of <i>mixed</i> content.</p>
    <p>ANOTHER is <i>here</i> but it's not the only one. SOMEWORDS are <i>mixted up</i> in the same
        paragraph. SOMETIMES even <i>multiple times.</i></p>

Now, my goal is to split paragraphs on each full-caps word. I thought I would go for grouping text starting with at least two capital letters like this:

<xsl:output method="xml" indent="true"></xsl:output>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="p">
  <xsl:for-each-group select="node()" group-starting-with="text()[matches(., '[A-Z]{2,}')]">
    <xsl:element name="p" >
      <xsl:apply-templates select="current-group()"/>

but this won't work because I'm dealing with mixed content rather than strings only. So I get this:

   <p>CAPITALWORD is part of <i>mixed</i> content.</p>
   <p>ANOTHER is <i>here</i>
   <p> but it's not the only one. SOMEWORDS are <i>mixed up</i> in the <i>same</i>
        paragraph. SOMETIMES even <i>multiple times.</i>

instead of the desired output:

    <p>CAPITALWORD is part of <i>mixed</i> content. </p>
    <p>ANOTHER is <i>here</i> but it's not the only one. </p>
    <p>SOMEWORDS are <i>mixed up</i> in the <i>same</i> paragraph. </p>
    <p>SOMETIMES even <i>multiple times.</i></p>

I will be most grateful for tips on how to achieve the desired output.


  • One approach is a two step transformation, the first step uses analyze-string on text nodes to wrap your capitalized word into an element, the second step then can easily use group-starting-with on those wrapper elements:

    <xsl:stylesheet xmlns:xsl="" version="3.0"
      <xsl:mode on-no-match="shallow-copy"/>
      <xsl:template match="p">
        <xsl:variable name="capitalized-marked-up" as="node()*">
          <xsl:apply-templates mode="markup-capitalized"/>
        <xsl:for-each-group select="$capitalized-marked-up" group-starting-with="capitalized-word">
            <xsl:apply-templates select="current-group()"/>
      <xsl:template match="capitalized-word">
      <xsl:mode name="markup-capitalized" on-no-match="shallow-copy"/>
      <xsl:template mode="markup-capitalized" match="text()">
        <xsl:apply-templates select="analyze-string(., '\p{Lu}{2,}')" mode="wrap"/>
      <xsl:template mode="wrap" match="fn:match">
      <xsl:output indent="yes"/>