Search code examples
regexsparqlw3cbnfebnf

W3C: Can't read EBNF's SPARQL IRIREF specification?


(Specifications: https://www.w3.org/TR/sparql11-query/#rIRIREF)

According to the specification, an IRIREF can be parsed as this:

[139]   IRIREF    ::=   '<' ([^<>"{}|^`\]-[#x00-#x20])* '>'

What is bothering me is this part of the expression:

\]-[

If I consider \ to be an escaping character in the bracketed character class (which would be the case in a Perl regular expression), then it means the \ alone is not a problem in the IRIREF and this is valid: <http://hello\world>

Then there is this big problem with the range: ]-[. The character ] has an ordinal value of 93 and the [ of 91. This means we have an invalid range: 93 to 92. This is not allowed in most regex engines I tested.

What does that means?

  1. Should I consider the - as a regular character in the bracketed character class, then this is invalid IRIREF: <http://new-example.org>. It makes no sense.
  2. Should I consider the range ]-[ null and this IRIREF is valid: <http://hello[world]>
  3. What I think is more likely is that the range is inverted and is not a problem for w3c specifications, which means the characters [, \ and ] are invalid characters. This makes sense.

Solution

  • The SPARQL spec says that its grammar is written using the notation defined by the XML 1.1 specification.

    In that notation, the right-hand side you quote,

    '<' ([^<>"{}|^`\]-[#x00-#x20])* '>'
    

    denotes a sequence of

    • a '<' character
    • zero or more characters matching the expression [^<>"{}|^`]-[#x00-#x20]; this is a set difference denoting

      • any character matched by [^<>"{}|^\] = any character other than '<', '>', '"', '{', '}', '|', '^', '', or '\'; n.b. '\' is not an escape character in this notation (which has no escape characters at all)
      • except those matched by [#x00-#x20] = the C1 area of control characters plus blank

      This is a slightly odd way to write this pattern; it could equally well be written as [^<>"{}|^`#x00-#x20]; I'm not sure why the editors wrote it the way they did.

    • a '>' character

    So to answer your questions one by one:

    Should I consider the - as a regular character in the bracketed character class, then this is invalid IRIREF: http://new-example.org. It makes no sense.

    No. When A and B are expressions in this notation, A - B denotes any string in the language of A that is not also a string in the language of B. Here A and B are each character-class expressions, one negative and one positive.

    You are right that it would make no sense to prohibit hyphens from a grammar rule intended to accept IRIs bracketed by angle brackets.

    Should I consider the range ]-[ null and this IRIREF is valid: http://hello[world]

    ']-[' does not denote a range here, null or otherwise; the ] ends the first character class expression and the [ begins the second.

    What I think is more likely is that the range is inverted and is not a problem for w3c specifications, which means the characters [, \ and ] are invalid characters. This makes sense.

    If my parsing of the expression is correct, '[' and ']' are legal (they are not excluded by the first expression, and they are not excluded by the second); '\' is excluded by the first expression.