Search code examples
regexxmlxsltschematron

Schematron regexp:test fails with broad expression


I am going nuts here..

The test fails using a broad match-anything regex like '^.+$' (shown in sample file) but works with a specific '^C.+$'

I also tried test="string-length(.) > 0" and it fails.

Please help.

This the XML file:

<article>
<back><ack>
<title>Acknowledgements</title>
<p>The authors would like to thank <funding-source rid="sp1">CNPq</funding-source> (Process numbers <award-id rid="sp1">303287/2013-6</award-id> and <award-id rid="sp1">303559/2012-8</award-id>), <funding-source rid="sp2">FAPESP</funding-source> (Process number <award-id rid="sp2">2012/12207-6</award-id>) and <funding-source rid="sp3">CAPES</funding-source> (Process number <award-id rid="sp3">12357-13-8</award-id>) for the financial support.</p>
</ack></back></article>

This is the schematron file that FAILS:

<schema xmlns="http://purl.oclc.org/dsdl/schematron"
        queryBinding="exslt"
        xml:lang="en">
  <ns uri="http://www.w3.org/1999/xlink" prefix="xlink"/>
  <ns uri="http://exslt.org/regular-expressions" prefix="regexp"/>
  <pattern id="funding_info">
    <title>Make sure funding-source does not happen inside p</title>

      <assert test="regexp:test(current(), '^.+$')">
          EC-CHECK: Nao deve haver 'funding-source' nem 'award-id' dentro de 'p'
      </assert>
    </rule>
  </pattern>
</schema>

This is the schematron file that WORKS:

<schema xmlns="http://purl.oclc.org/dsdl/schematron"
        queryBinding="exslt"
        xml:lang="en">
  <ns uri="http://www.w3.org/1999/xlink" prefix="xlink"/>
  <ns uri="http://exslt.org/regular-expressions" prefix="regexp"/>
  <pattern id="funding_info">
    <title>Make sure funding-source does not happen inside p</title>

      <assert test="regexp:test(current(), '^C.+$')">
          EC-CHECK: Nao deve haver 'funding-source' nem 'award-id' dentro de 'p'
      </assert>
    </rule>
  </pattern>
</schema>

Solution

  • It seems that the backslashes in the XSL can be used to define escape sequences. When you need to define a regex shorthand character class, you need to prepend specific characters with a literal backslash, thus, you need to use a double backslash:

    ^[\\s\\S]+$
    

    This pattern will match:

    • ^ - start of string
    • [\\s\\S]+ - one or more characters that are either whitespace or non-whitespace (thus, this matches any character)
    • $ - end of string.

    This also means that the regex flavor is not JavaScript, althoug this reference claims EXSLT uses JS flavor.