Search code examples
xsdxsd-validationxercesxerces2-j

Xerces-J xsd:base64binary lexical validation question


I've recently upgraded my project from Xerces-J 2.7.0 to Xerces-J 2.12.1 and I'm seeing a change in schema validation behaviour. I'm not entirely clear if my test is wrong or Xerces is.

Given this schema:

<?xml version='1.0'?>
<xsd:schema xmlns:xsd='http://www.w3.org/2001/XMLSchema'>
  <!-- Schema to test facets for the xsd:base64Binary datatype. -->
  <xsd:element name="facetTest" type="FacetTestComplexType"/>
  <xsd:complexType name="FacetTestComplexType">
    <xsd:sequence>
      <xsd:element name='enumeration' type='EnumerationType' minOccurs="0"/>
    </xsd:sequence>
  </xsd:complexType>

  <!-- ***** Enumeration ***** -->
  <xsd:simpleType name='EnumerationType'>
    <xsd:restriction base='xsd:base64Binary'>
      <xsd:enumeration value='Ab1+'/>
      <xsd:enumeration value='7 d Ec'/>
    </xsd:restriction>
  </xsd:simpleType>
</xsd:schema>

And this instance document:

<facetTest>
  <enumeration>7dEc</enumeration>
</facetTest>

With Xerces-J 2.7.0 that instance document would be valid, however when using Xerces-J 2.12.1 it now is flagged as invalid.

I reviewed the schema base64binary specification and it's left me unclear on whether this should be valid (my code is right and Xerces-J is wrong) or visa versa. This is the passage that has thrown me:

Note that this grammar requires the number of non-whitespace characters in the lexical form to be a multiple of four, and for equals signs to appear only at the end of the lexical form; strings which do not meet these constraints are not legal lexical forms of base64Binary because they cannot successfully be decoded by base64 decoders.

Note: The above definition of the lexical space is more restrictive than that given in [RFC 2045] as regards whitespace -- this is not an issue in practice. Any string compatible with the RFC can occur in an element or attribute validated by this type, because the ·whiteSpace· facet of this type is fixed to collapse, which means that all leading and trailing whitespace will be stripped, and all internal whitespace collapsed to single space characters, before the above grammar is enforced.

According to the definition of enumeration, it restricts the value-space, not the lexical-space. In that case it seems the value-space appears to cover the original binary content. If that's the case, then the whitespace should be meaningless.

Any clarification on whether my code or Xerces is incorrect would be greatly appreciated.


Solution

  • I think your code is correct, and Xerces has started to behave incorrectly.

    Although the base64 values in your enums look strange, they do conform to the grammar specified here: https://www.w3.org/TR/xmlschema-2/#base64Binary

    This is what the XSD spec says about enumeration facets:

    Validation Rule: enumeration valid: A value in a ·value space· is facet-valid with respect to ·enumeration· if the value is one of the values specified in {value}

    So I agree with your statement:

    According to the definition of enumeration, it restricts the value-space, not the lexical-space. In that case it seems the value-space appears to cover the original binary content.