Search code examples
xmlxsltxslt-2.0

How to have internal element of text node include text after element


Authors of an xml document did not include all the text inside an element that will be converted to a hyperlink. I would like to process or pre-process the xml to include the necessary text. I find this hard to describe but a simple example should show what I'm attempting. I'm using XSLT 2.0. I already do regular expression processing for various situations but can't figure this out.

I know how to do this with perl/python regular expression but I can't figure out how to approach this with XSLT.

Here is 'very' simplfied xml from an author in which they left out the ' (Sheet 3)' from the glink element.:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
    <para>
        Go look at figure <glink refid=1>Figure 22</glink> (Sheet 3). Then go do something else.
    </para>
</root>

Here is what I'd like it to convert to where the ' (Sheet 3)' is now inside the glink tag:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
    <para>
        Go look at figure <glink refid=1>Figure 22 (Sheet 3)</glink>. Then go do something else.
    </para>
</root>

The case when this conversion should happen is when there is a glink element followed by (this regular expression):

\s\(Sheet \d\)

I currently have 2 XSLTs. The first pre-processes the XML to convert a number of other situations (using regular expression/xsl:analyze-string). The second XSLT to convert from pre-processed xml to HTML. The second XSLT has a template to handle glink elements and turn it into a hyperlink but the hyperlink should be including the Sheet information.

I would assume that it is easier to pre-process this first and leave the 2nd XSLT alone, but I always appreciate better ways.
Thank you for your time.


Solution

  • In order to reduce the use of regex functions, I would use this approach:

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
      <xsl:template match="node()|@*">
        <xsl:copy>
          <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
      </xsl:template>
    
      <xsl:template match="glink">
        <xsl:variable name="vAnalyzedString">
            <xsl:analyze-string 
                select="following-sibling::node()[1][self::text()]"
                regex="^\s*\(Sheet\s+\d+\)">
                <xsl:matching-substring>
                    <match>
                        <xsl:value-of select="."/>
                    </match>
                </xsl:matching-substring>
                <xsl:non-matching-substring>
                    <no-match>
                        <xsl:value-of select="."/>
                    </no-match>
                </xsl:non-matching-substring>
            </xsl:analyze-string>
        </xsl:variable>
        <xsl:copy>
          <xsl:apply-templates select="node()|@*"/>
          <xsl:apply-templates 
            select="$vAnalyzedString/match/text()"/>
        </xsl:copy>
        <xsl:apply-templates 
            select="$vAnalyzedString/no-match/text()"/>
      </xsl:template>
    
      <xsl:template match="text()[preceding-sibling::node()[1][self::glink]]"/>
    </xsl:stylesheet>
    

    Output:

    <root>
       <para>
            Go look at figure <glink refid="1">Figure 22 (Sheet 3)</glink>. Then go do something else.
        </para>
    </root>
    

    Do note: all glink are processed but none of those text nodes being the first siblings. It's posible to use xsl:analize-string instruction, but you will need to declare a variable with partial results and then navegate those results. Also, this approach might easily let you further processing those (now) text nodes and it has only one regex processing.