Search code examples
regexschematron

How do I test for special characters using Schematron tests?


I am trying to set up a schematron test for validating special characters in XML...

More specifically, I would like to throw a warning where there is an occurrence of the copyright symbol (Unicode U+00A9).

It seems that schematron xml files cannot be parsed when using any of the following notation for the rules...

<iso:rule context="myelement>
   <iso:report test="matches(., '\u00A9')">{ES1037} Copyright Symbol Detected</iso:report>
</iso:rule> 

<iso:rule context="myelement>
   <iso:report test="matches(., '\u{00A9}')">{ES1037} Copyright Symbol Detected</iso:report>
</iso:rule> 

<iso:rule context="myelement>
   <iso:report test="matches(., '\u{A9}')">{ES1037} Copyright Symbol Detected</iso:report>
</iso:rule> 

<iso:rule context="myelement>
   <iso:report test="matches(., '\x{00A9}')">{ES1037} Copyright Symbol Detected</iso:report>
</iso:rule> 

Any schematron experts out there that know how to accomplish embedding a unicode character into a regex?

Thanks in advance...


Solution

  • You need to write the code as character entity like it is used for the XML Schema standard:

    <?xml version="1.0" encoding="UTF-8"?>
    <iso:schema xmlns:iso="http://purl.oclc.org/dsdl/schematron">
        <iso:pattern id="unicode in regex">
            <iso:rule context="a">
                <iso:report test="matches(., '&#xa9;')">
                    Copyright found
                </iso:report>
            </iso:rule>
        </iso:pattern>
    </iso:schema>
    

    Output in XML ValidatorBuddy