Search code examples
xmlxsltunicodexslt-2.0

XSLT - Transform unicode characters


I have a xml like this,

<doc>
    <?PIValue  &#x00D2;&#x00D3;&#x00D4;&#x00D5;&#x00D6;&#x00D8; &#x00C0;&#x00C1;&#x00C2;&#x00C3;&#x00C4;&#x00C5;?>
    <p>&#x00D2;&#x00D3;&#x00D4;&#x00D5;&#x00D6;&#x00D8; &#x00C0;&#x00C1;&#x00C2;&#x00C3;&#x00C4;&#x00C5;</p>
</doc>

I have a XSLT transform for this XML as follows,

<xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="doc">
        <doc>
            <xsl:apply-templates/>
            <p2><xsl:value-of select="processing-instruction('PIValue')"/></p2>
        </doc>
    </xsl:template>

    <xsl:template match="p">
        <p1>
            <xsl:apply-templates/>
        </p1>
    </xsl:template>

Output from the above transform this this,

<doc>
    <?PIValue &#x00D2;&#x00D3;&#x00D4;&#x00D5;&#x00D6;&#x00D8; &#x00C0;&#x00C1;&#x00C2;&#x00C3;&#x00C4;&#x00C5;?>
    <p1>ÒÓÔÕÖØ ÀÁÂÃÄÅ</p1>
    <p2>&amp;#x00D2;&amp;#x00D3;&amp;#x00D4;&amp;#x00D5;&amp;#x00D6;&amp;#x00D8; &amp;#x00C0;&amp;#x00C1;&amp;#x00C2;&amp;#x00C3;&amp;#x00C4;&amp;#x00C5;</p2>
</doc>

As you can see unicode characters was within <p> element has shown as normal text in the output (within <p1> element). But same unicode characters within the processing instruction has not shown as there relevant characters in the output (within <p2> element).

How can I change my transform to show text string in element as well.

expected output,

<doc>
    <?PIValue &#x00D2;&#x00D3;&#x00D4;&#x00D5;&#x00D6;&#x00D8; &#x00C0;&#x00C1;&#x00C2;&#x00C3;&#x00C4;&#x00C5;?>
    <p1>ÒÓÔÕÖØ ÀÁÂÃÄÅ</p1>
    <p2>ÒÓÔÕÖØ ÀÁÂÃÄÅ</p2>
</doc>

Solution

  • In XML, character references (like &#xd2;) are recognized in element and attribute content, but not in processing instructions or comments. So in your processing instruction the string &#x00D2; is just a string of 8 characters, not a reference to the single character xD2.

    If you want to interpret the &#x00D2; strings as character references, then you can either submit them to an XML parser (as Martin Honnen suggests), or you can parse them out "by hand" in your own code. It's not that difficult: xsl:analyze-string will extract the '00D2' part, writing a recursive function to convert hex to integer is fairly straightforward, and then the final part is to call codepoints-to-string to convert the integer code to a character (=a string of length one).