Search code examples
validationxsdescapingwhitespacetokenize

XML schema validation of literal versus excaped new line?


I have a working xsd that refuses to validate XML instances containing invalid whitespace (see below for more details, but that includes the carriage return (#xD), line feed (#xA) nor tab (#x9) characters, no beginning or ending space (#x20) character, or a sequence of two or more adjacent space characters).

Sample XSD:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 
elementFormDefault="qualified" 
targetNamespace="http://www.example.com"
xmlns:test="http://www.example.com">

<xs:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="http://www.w3.org/2001/xml.xsd"/>

<xs:element name="test-token" type="test:Tokenized500Type"></xs:element>

<xs:simpleType name="Tokenized500Type">
    <xs:annotation>
        <xs:documentation>An element of this type has minimum length of one character, a max of 500, and may not
            contain any of: the carriage return (#xD), line feed (#xA) nor tab (#x9) characters, shall
            not begin or end with a space (#x20) character, or a sequence of two or more adjacent space
            characters.</xs:documentation>
    </xs:annotation>
    <xs:restriction base="xs:string">
        <xs:maxLength value="500"/>
        <xs:minLength value="1"/>
        <xs:pattern value="\S+( \S+)*"/>

    </xs:restriction>
</xs:simpleType>

I tested this with literal whitespace characters as above.

What if the XML instance includes escaped whitespace in the relevant element content? Will this cause a validation error or not?

Here's an example instance with the escaped version:

<?xml version="1.0" encoding="UTF-8"?>
<test-token xmlns="http://www.example.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.example.com">&#13;</test-token>

See also:


Solution

  • The regex should operate on the expanded (unescaped) string, so there should be no distinction between the new line literal and &#13;

    \S matches anything but a whitespace (short for [^\f\n\r\t\v\u00A0\u2028\u2029]).
    

    Also note that the Regex's used in XSD are Unicode Regular Expression (which differ from the more standard posix regex's, and to make matters worse some of the parsers use whatever regex parser happens to by knocking around (xsd validation in .net uses its internal regex parser - which is not 'Unicode Regular Expression').

    Note: The ·regular expression· language defined here does not attempt to provide a general solution to "regular expressions" over UCS character sequences. In particular, it does not easily provide for matching sequences of base characters and combining marks. The language is targeted at support of "Level 1" features as defined in [Unicode Regular Expression Guidelines]. It is hoped that future versions of this specification will provide support for "Level 2" features.