Search code examples
xmlxsltjaxp

About JAXP, XSLT and XML reserved characters


It looks like JAXP allows assigning any value to a document node, including <, >, and & and others. Playing with XML reserved characters and XSLT raises a question. Consider the following code:

DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.newDocument();

...

Element field = doc.createElement("col");
field.setTextContent( "<p>&]]" );
row.appendChild( field );

...

TransformerFactory factory = TransformerFactory.newInstance();
Source xslt = new StreamSource(new File("templateName.xsl"));
Transformer transformer = factory.newTransformer(xslt);

transformer.transform( new DOMSource(doc), new StreamResult(printer) );

Now, if we have

<xsl:value-of select="col" disable-output-escaping="yes"/>

in "templateName.xsl", the output will look like

"<p>&]]"

and if we have this

<xsl:value-of select="col"/>

the output will be

&lt;p&gt;&amp;]]

so basically my question is, what kind of internal data representation JAXP uses such that this

"<p>&]]"

is OK? It cannot be a text node, and cannot be a CDATA node, too. What is it? There must be a valid XML document supplied for a transformation, I believe. On the other hand, disable-output-escaping attribute indicates that special characters should be output as-is, does it mean our "col" node is kept as in the code? How come the XML document is valid then?


Solution

  • OK, I think I've figured out how it works. Any of the XML reserved symbols must be escaped unless they are in a CDATA node. Next, what disable-output-escaping="yes" attribute will do depends on the node type. If it's a text node, it will undo escaping such that "&lt;" transforms to "<". In case it's a CDATA node, it will disable escaping and CDATA will be output as-is. In either case, all tags enclosed in a text node are stripped off while retained for CDATA (and escaped according to disable-output-escaping). So either DOMSource or Transformer (not sure who renders DOM to XML) will do actual escaping of a DOM text node before transformation (and CDATA is kept intact). So for a text node, disable-output-escaping should read undo-xml-escaping which solves my confusion.

    Anyways, thanks to Michael for explanation!