Search code examples
xmlxsltxslt-2.0

XSLT analyze string split part of mixedContent into new element


Text version

XML Source contains a mixedContent-Element named paragraph. Most of the time the content starts with a number in brackets, e.g. (1). The number is always the first (part of a) text node.

XML Target handles this specific number in a seperate element named counter.

How to process paragraph in an efficient way?

Example number masks

(1)
(0...9)
[0...9]
{:digits:}

Example paragraph source

<paragraphs>
    <paragraph>(1) text <try>1</try> <italic>italic</italic> stuff</paragraph>
    <paragraph>[2] text <try>2</try> <italic>italic</italic> stuff</paragraph>
    <paragraph>{123} text <try>3</try> <italic>italic</italic> stuff</paragraph>
    <paragraph>text <try>4</try> <italic>italic</italic> stuff</paragraph>   
</paragraphs>

Example paragraph target

<paragraphs>    
    <frame>
        <counter>(1)</counter>
        <paragraph>text <try>1</try> <italic>italic</italic> stuff</paragraph>
    </frame>
    <frame>
        <counter>[2]</counter>
        <paragraph>text <try>2</try> <italic>italic</italic> stuff</paragraph>
    </frame>
    <frame>
        <counter>{123}</counter>
        <paragraph>text <try>3</try> <italic>italic</italic> stuff</paragraph>
    </frame>
    <frame>
        <paragraph>text <try>4</try> <italic>italic</italic> stuff</paragraph>
    </frame>
 </paragraphs>

not(functional) part

<xsl:template match="paragraph">
    <frame>
        <xsl:analyze-string select="." regex="(^[^\s]+)"><!-- TODO: select digits instead of the first whitespace! -->
            <xsl:matching-substring>
                <xsl:element name="counter">
                    <xsl:value-of select="regex-group(1)" />
                </xsl:element>
            </xsl:matching-substring>
        </xsl:analyze-string>
        <paragraph>
            <xsl:apply-templates/><!-- TODO: everything but not the part of regex-group(1) + whitespace-character -->
        </paragraph>
    </frame>
</xsl:template>

I stopped working on this template because maybe there is a better solution to tackle this problem.

Any help is appreciated.


Solution

  • If you simply need to extract the two parts from the very first child node that is a text node then I think the following does that:

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0">
    
        <xsl:param name="counter-pattern" as="xs:string">^(\([0-9+]\)|\[[0-9]+\]|\{[0-9]+\})</xsl:param>
    
        <xsl:template match="@* | node()" mode="#all">
            <xsl:copy>
                <xsl:apply-templates select="@* | node()" mode="#current"/>
            </xsl:copy>
        </xsl:template>
    
        <xsl:template match="paragraph">
            <frame>
                <xsl:apply-templates select="." mode="counter"/>
            </frame>
        </xsl:template>
    
        <xsl:template match="paragraph[node()[1][self::text()[matches(., $counter-pattern)]]]"
            mode="counter">
            <xsl:variable name="components" as="xs:string*">
                <xsl:analyze-string select="node()[1]" regex="{$counter-pattern}">
                    <xsl:matching-substring>
                        <xsl:sequence select="."/>
                    </xsl:matching-substring>
                    <xsl:non-matching-substring>
                        <xsl:sequence select="."/>
                    </xsl:non-matching-substring>
                </xsl:analyze-string>
            </xsl:variable>
            <counter>
                <xsl:value-of select="$components[1]"/>
            </counter>
            <xsl:copy>
                <xsl:value-of select="$components[2]"/>
                <xsl:apply-templates select="node()[position() gt 1]"/>
            </xsl:copy>
        </xsl:template>
    
    </xsl:stylesheet>
    

    You might want to use <xsl:value-of select="replace($components[2], '^\s+', '')"/> instead of <xsl:value-of select="$components[2]"/> if the white space between the counter and the following text is not supposed to show up in the paragraph.

    Take the regular expression as an example, you might need to adapt that to your needs as well.