Search code examples
xsltxslt-2.0xslt-3.0

XSLT to select and transform node (with regex match) and following siblings until next similar node


Somewhat simplified, my XML looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<dict>
    <entry>
        <form>word</form>
        <gram>noun</gram>
        <span style="bold">1.</span>
        <def>this is a definition in the first sense.</def> – <cit type="example">
            <quote>This is a <span style="bold">quote</span> for the first sense. </quote>
        </cit>
        <span style="bold">2.</span>
        <def>This is a definition for the second sense</def> – <cit type="example">
            <quote>This is a quote for the second sense.</quote>
        </cit>
    </entry>    
</dict>

I need to transform this using XSLT 2.0 or 3.0 to get the following:

<?xml version="1.0" encoding="UTF-8"?>
<dict>
    <entry>
        <form>word</form>
        <gram>noun</gram>
        <sense n="1">
            <def>this is a definition in the first sense.</def> – <cit type="example">
                <quote>This is a <span style="bold">quote</span> for the first sense. </quote>
            </cit>
        </sense>
        <sense n="2">
            <def>This is a definition for the second sense</def> – <cit type="example">
                <quote>This is a quote for the second sense.</quote>
            </cit>
        </sense>
    </entry>
</dict>

Тhere can be more than two senses, and span style bold can occur elsewhere, so we need to identify specifically something like tei:span[@style='bold'][matches(text(), '^\d\.')] for this.

I'm having a hard time putting this together in a stylesheet that would also extract the number for the span's text node and use it as the attribute value of the new element <sense>.

I'll be most grateful for your tips.x


Solution

  • Here is an XSLT 3.0 sample

    <xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
    
        <xsl:mode on-no-match="shallow-copy"/>
    
        <xsl:output indent="yes"/>
    
        <xsl:template match="entry">
            <xsl:copy>
                <xsl:for-each-group select="node()" group-starting-with="span[@style = 'bold'][matches(., '^[0-9]+\.$')]">
                    <xsl:choose>
                        <xsl:when test="self::span[@style = 'bold'][matches(., '^[0-9]+\.$')]">
                            <sense nr="{replace(., '[^0-9]+', '')}">
                                <xsl:apply-templates select="current-group() except ."/>
                            </sense>
                        </xsl:when>
                        <xsl:otherwise>
                            <xsl:apply-templates select="current-group()"/>
                        </xsl:otherwise>
                    </xsl:choose>
                </xsl:for-each-group>
            </xsl:copy>
        </xsl:template>
    
    </xsl:stylesheet>
    

    producing the output

    <?xml version="1.0" encoding="UTF-8"?>
    <dict>
        <entry>
            <form>word</form>
            <gram>noun</gram>
            <sense nr="1">
            <def>this is a definition in the first sense.</def> – <cit type="example">
                <quote>This is a <span style="bold">quote</span> for the first sense. </quote>
            </cit>
            </sense>
            <sense nr="2">
            <def>This is a definition for the second sense</def> – <cit type="example">
                <quote>This is a quote for the second sense.</quote>
            </cit>
        </sense>
        </entry>    
    </dict>