Search code examples
xmlxsltcdatahtml-escape-characters

XSL unescape HTML inside CDATA


I'm trying to transform XML:

 <catalog>
            <country><![CDATA[ WIN8 &lt;b&gt;X&lt;/b&gt; Mac OS ]]></country>
    </catalog>

into

<catalog>
        <country><![CDATA[  WIN8 <b>X</b> Mac OS ]]></country>        
</catalog>

with an XSL transform.

I know that using disable-output-escaping="yes" or cdata-section-elements I could transform escaped characters into unescaped and put inside CDATA, but this does not work if charaters are already inside CDATA.

Is there a simple way for this? Thanks.


Solution

  • This

    <catalog>
      <country><![CDATA[  WIN8 <b>X</b> Mac OS ]]></country>        
    </catalog>
    

    is equivalent to

    <catalog>
      <country> WIN8 &lt;b&gt;X&lt;/b&gt; Mac OS </country>
    </catalog>
    

    Which is exactly what you get when using

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
      <xsl:output omit-xml-declaration="yes" indent="yes" />
    
      <xsl:template match="node() | @*">
        <xsl:copy>
          <xsl:apply-templates select="node() | @*" />
        </xsl:copy>
      </xsl:template>
    
      <xsl:template match="country/text()">
        <xsl:value-of select="." disable-output-escaping="yes" />
      </xsl:template>
    </xsl:stylesheet>
    

    The point is that disable-output-escaping (DOE) has no effect in an element that falls into cdata-section-elements (CSE). That's because both directives disable output escaping.

    The text value " WIN8 <b>X</b> Mac OS " becomes:

    • when serialized normally: WIN8 &lt;b&gt;X&lt;/b&gt; Mac OS

    • when serialized with CSE: <![CDATA[ WIN8 <b>X</b> Mac OS ]]>

    • when serialized with DOE: WIN8 <b>X</b> Mac OS

    Note how the last two renderings are exactly the same, except for the enclosing <![CDATA[ ... ]]>.

    CDATA disables output escaping for text node children of an element and in exchange encloses them in <![CDATA[ ... ]]> markers to make up for the lost level of escaping.

    If you additionally set DOE on an <xsl:value-of> that outputs a text into an element that has CSE set, nothing happens. Output escaping already is disabled.

    Therefore this

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
      <xsl:output omit-xml-declaration="yes" indent="yes" />
      <xsl:output cdata-section-elements="country" />
    
      <xsl:template match="node() | @*">
        <xsl:copy>
          <xsl:apply-templates select="node() | @*" />
        </xsl:copy>
      </xsl:template>
    
      <xsl:template match="country/text()">
        <xsl:value-of select="." disable-output-escaping="yes" />
      </xsl:template>
    </xsl:stylesheet>
    

    will give you exactly what your input was.

    That's why you cannot get rid of double escaping and have CDATA during the same transformation. You could use a two-step approach (1st step disables output escaping, 2nd step adds back CDATA) if you positively must have CDATA in the result document — but personally I think it's not worth it.