Consider this "identity" transform:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output encoding="UTF-8" method="xml" indent="yes" media-type="application/xml"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:transform>
and this input XML:
<?xml version="1.0" encoding="UTF-8"?>
<Foobar xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:transform version="2.0">
<!-- Parameters -->
<xsl:param name="param1"/>
<xsl:param name="param2"/>
<xsl:param name="param3"/>
<!-- Variables -->
<xsl:variable name="variable1" select="'abc'"/>
<xsl:template match="/*">
</xsl:template>
</xsl:transform>
</Foobar>
Why does SaxonJ-HE 11.3 delete the blank lines?
Here's a diff showing what I'm talking about:
$ saxon -xsl:transform.xsl -s:input.xml | diff -u input.xml -
--- input.xml 2022-06-16 16:26:41.000000000 -0400
+++ - 2022-06-16 16:28:42.000000000 -0400
@@ -6,12 +6,9 @@
<xsl:param name="param1"/>
<xsl:param name="param2"/>
<xsl:param name="param3"/>
-
<!-- Variables -->
<xsl:variable name="variable1" select="'abc'"/>
-
<xsl:template match="/*">
</xsl:template>
-
</xsl:transform>
</Foobar>
It's quite challenging to find an indentation algorithm that both (a) preserves existing whitespace in the source document, and (b) produces nice-looking output. For example, consider what happens when a template rule processes all children of an element (both element children and whitespace text node children) with an xsl:sort
on an attribute value; if all whitespace from this output sequence is preserved, this will tend to put a massive wadge of whitespace at the start of the output sequence, which looks pretty ugly. This can also happen if you apply-templates to all children, but delete some of the elements while leaving the text nodes unchanged. So the spec allows the processor not only to add whitespace for indentation, but to merge ("elide") this with existing whitespace.
In particular, it's a reasonable assumption to make that if you get multiple blank lines in the result tree, they weren't put there deliberately, but arrived by accident as a result of copying multiple whitespace nodes from the input.
What's actually happening in this particular case is as follows:
For comments, the rules are different depending on whether the comment follows a start tag or an end tag. The first comment follows a start tag, and in this case the accumulated whitespace is output as-is, followed by the comment with no further indentation. The second comment follows an end tag (actually an empty element tag), and in this case the comment is indented according to its hierarchic level in the result tree, and any preceding whitespace in the result tree is discarded.
Before a start tag, indentation is added if the start tag immediately follows another start tag or end tag; if it follows a text node, no identation is added. This rule is designed primarily to make mixed content work properly.
Before an end tag, indentation is added if it follows another end tag, but not if it follows a start tag or character data.
The detail is a lot more complex, and it has evolved in a fairly ad-hoc way to cope reasonably well with a wide variety of circumstances. As a high-level summary, Saxon will in some circumstances output the whitespace that it finds in the result tree, and in other circumstances it will output its own whitespace in preference. The algorithm isn't perfect, but it copes reasonably well with messy situations like when the input is indented with 4 spaces and the output is to be indented with 3.