I have a working xsd that refuses to validate XML instances containing invalid whitespace (see below for more details, but that includes the carriage return (#xD), line feed (#xA) nor tab (#x9) characters, no beginning or ending space (#x20) character, or a sequence of two or more adjacent space characters).
Sample XSD:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified"
targetNamespace="http://www.example.com"
xmlns:test="http://www.example.com">
<xs:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="http://www.w3.org/2001/xml.xsd"/>
<xs:element name="test-token" type="test:Tokenized500Type"></xs:element>
<xs:simpleType name="Tokenized500Type">
<xs:annotation>
<xs:documentation>An element of this type has minimum length of one character, a max of 500, and may not
contain any of: the carriage return (#xD), line feed (#xA) nor tab (#x9) characters, shall
not begin or end with a space (#x20) character, or a sequence of two or more adjacent space
characters.</xs:documentation>
</xs:annotation>
<xs:restriction base="xs:string">
<xs:maxLength value="500"/>
<xs:minLength value="1"/>
<xs:pattern value="\S+( \S+)*"/>
</xs:restriction>
</xs:simpleType>
I tested this with literal whitespace characters as above.
What if the XML instance includes escaped whitespace in the relevant element content? Will this cause a validation error or not?
Here's an example instance with the escaped version:
<?xml version="1.0" encoding="UTF-8"?>
<test-token xmlns="http://www.example.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.example.com"> </test-token>
See also:
Meaning of xs:token for XSD processor: Will an instance with a xsd:token-type element containing whitespace pass validation?
XSD restriction to allow only xs:token whitespace: What is the regular expression for the set of strings that validate exactly the same for xsd:token and xsd:string?
The regex should operate on the expanded (unescaped) string, so there should be no distinction between the new line literal and
\S matches anything but a whitespace (short for [^\f\n\r\t\v\u00A0\u2028\u2029]).
Also note that the Regex's used in XSD are Unicode Regular Expression (which differ from the more standard posix regex's, and to make matters worse some of the parsers use whatever regex parser happens to by knocking around (xsd validation in .net uses its internal regex parser - which is not 'Unicode Regular Expression').
Note: The ·regular expression· language defined here does not attempt to provide a general solution to "regular expressions" over UCS character sequences. In particular, it does not easily provide for matching sequences of base characters and combining marks. The language is targeted at support of "Level 1" features as defined in [Unicode Regular Expression Guidelines]. It is hoped that future versions of this specification will provide support for "Level 2" features.