Search code examples
xmlxsdremoving-whitespacexml-signaturecanonicalization

Canonical XML: whitespace in element-only container?


I have a simple XML file with an XSD schema, where some elements are allowed to contain only certain elements, e.g.

<xsd:element name="day" type="xsd:date"/>
<xsd:element name="interval">
    <xsd:complexType>
        <xsd:sequence>
            <xsd:element ref="day" minOccurs="2" maxOccurs="2"/>
        </xsd:sequence>
    </xsd:complexType>
</xsd:element>

and the XML code:

<interval>
    <day>2016-08-21</day>
    <day>2016-10-21</day>
</interval>

If within the interval tags I type anything but whitespace or day, it will (correctly) fail to validate. Now, using lxml in python, I extracted the canonical version (C14N) of such XML, and I found that the whitespace (those 4 spaces of indentation) were kept (as the standard says).

I need then to digitally sign this document, but I do not understand why would anyone sign that whitespace. It seems a weakness to me: different indentation implies different canonical XML (and mismatching signatures); but this is an unambiguous case in which that whitespace has nothing to do with the represented data (all the more so as the schema would not validate against any meaningful content).

  • Why is that whitespace part of a canonical representation of an XML involved in digital signatures?
  • Is there any way of enforcing in the XSD the removal of such useless whitespace?

I was thinking more specifically of the whiteSpace facet. By specifying collapse the whitespace should be removed on validation; but it seems that whiteSpace cannot be applied to a complexType, and I could not find a way of combining it with a sequence.

  • Can I apply the whiteSpace facet to a complexType (element only) node?

Solution

  • Why is that whitespace part of a canonical representation of an XML involved in digital signatures?

    It's difficult to answer "why" questions, even if you were a member of the working group that published the spec (which I wasn't). I don't know why the spec authors made that decision, but I imagine that a decision either way would inconvenience some users at the expense of others.

    Is there any way of enforcing in the XSD the removal of such useless whitespace?

    Whitespace between elements in element-only content models is not considered significant in the PSVI. If you want to physically remove it, a practical way to do this is by copying the validated document with a schema-aware XSLT or XQuery processor, for example

    java net.sf.saxon.Query -s:input.xml -xsd:input.xsd -val:strict -qs:.
    

    (The query "." here returns the input document after validation).

    Can I apply the whiteSpace facet to a complexType (element only) node?

    No, and you don't need to.