Search code examples
xslt

How to select elements containing special characters using XSL?


I have an ascii-encoded XML-file (in which the various special characters are encoded as &#x..;). Here is a simplified example:

<?xml version="1.0" encoding="ascii"?>
<data>
    <element1>Some regular text</element1>
    <element2>Text containing special characters: 1&#xba;-2&#xaa;</element2>
    <element3>Again regular text, but with the special charactre prefix: #x</element3>
</data>

Now what I want to do is to pick all the leaf elements containing special characters. The output should look like

The following elements in the input file contain special characters:
<element2>Text containing special characters: 1&#xba;-2&#xaa;</element2>

I tried with this XSL:

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
    <xsl:output omit-xml-declaration="yes"/>
    <xsl:template match="/">
        <xsl:text>The following elements in the input file contain special characters:
        </xsl:text>
        <xsl:for-each select="//*">
            <xsl:if test="not(*) and contains(., '&amp;#x')">
                <xsl:copy-of select="."></xsl:copy-of>
            </xsl:if>
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

But it only gives me:

The following elements in the input file contain special characters:

If I try to search for just "#x" with this XSL:

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
    <xsl:output omit-xml-declaration="yes"/>
    <xsl:template match="/">
        <xsl:text>The following elements in the input file contain special characters:
        </xsl:text>
        <xsl:for-each select="//*">
            <xsl:if test="not(*) and contains(., '#x')">
                <xsl:copy-of select="."></xsl:copy-of>
            </xsl:if>
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

I get:

The following elements in the input file contain special characters:
        <element3>Again regular text, but with the special character prefix: #x</element3>

So the question is: is there any way to find those elements which contain special characters encoded as "&#x..;"?

I know I can do this with grep etc:

grep '&#x' simpletest.xml
    <element2>Text containing special characters: 1&#xba;-2&#xaa;</element2>

but the ultimate goal is to generate a pretty output with information about parent elements etc that can be sent as email notification, and using XSLT would make that part so much easier.


Solution

  • In XSLT/XPath you can't know whether any Unicode character was literally in the input document or as a character reference but in XSLT 2 or 3 you can certainly check with matches and Unicode ranges whether certain characters occur (e.g. with \P{IsBasicLatin} for anything not ASCII/Latin):

    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
        <xsl:output omit-xml-declaration="yes"/>
        <xsl:template match="/">
            <xsl:text>The following elements in the input file contain special characters:
            </xsl:text>
            <xsl:for-each select="//*[not(*) and matches(., '\P{IsBasicLatin}')]">
                <xsl:copy-of select="."></xsl:copy-of>
            </xsl:for-each>
        </xsl:template>
    </xsl:stylesheet>
    

    Output:

    The following elements in the input file contain special characters:
        <element2>Text containing special characters: 1º-2ª</element2>