Search code examples
computer-sciencedslbnfebnf

Can I use regular expressions to define strings in ISO EBNF?


I'm using the standardized version (ISO/IEC 14997 : 1996(E)) EBNF to define my grammar. The standardized version is a meta-meta-language (it can parse itself).

I define a letter as such:

letter =  'A' | 'B' | 'C' | 'D' | 'E' | 'H' | 'I' | 'J' | 'K' | 'L' |
'O' | 'P' | 'Q' | 'R' | 'S' | 'V' | 'W' | 'X' | 'Y' | 'Z' | 'a' | 'b'
| 'c' | 'd' | 'e' | 'h' | 'i' | 'j' | 'k' | 'l' | 'o' | 'p' | 'q' |
'r' | 's' | 'v' | 'w' | 'x' | 'y' | 'z' 'F' | 'G' | 'M' | 'N' | 'T' |
'U' | 'f' | 'g' | 'm' | 'n' | 't' | 'u';

I would prefer to write, more simply, letter = [a..z]|[A..Z];

My question is: Would defining letter in such form (using a regexp) ruin EBNFs property of being self defining?


Solution

  • Use a special sequence for this:

    A special-sequence consists of a special-sequence-symbol followed by a (possibly empty) sequence of special- sequence-characters followed by a special-sequence- symbol.

    The sequence of symbols represented by a special-sequence is outside the scope of this International Standard. Only the format of a special-sequence is defined in this International Standard. A special-sequence provides a notation for extensions which a user may require.

    The W3C uses it extensively. For example:

    The formal grammar of XML is given in this specification using a simple Extended Backus-Naur Form (EBNF) notation. Each rule in the grammar defines one symbol, in the form
    
    symbol ::= expression
    
    Symbols are written with an initial capital letter if they are the start symbol of a regular language, otherwise with an initial lowercase letter. Literal strings are quoted.
    
    Within the expression on the right-hand side of a rule, the following expressions are used to match strings of one or more characters:
    
    #xN
    
        where N is a hexadecimal integer, the expression matches the character whose number (code point) in ISO/IEC 10646 is N. The number of leading zeros in the #xN form is insignificant.
    [a-zA-Z], [#xN-#xN]
    
        matches any Char with a value in the range(s) indicated (inclusive).
    [abc], [#xN#xN#xN]
    
        matches any Char with a value among the characters enumerated. Enumerations and ranges can be mixed in one set of brackets.
    [^a-z], [^#xN-#xN]
    
        matches any Char with a value outside the range indicated.
    [^abc], [^#xN#xN#xN]
    
        matches any Char with a value not among the characters given. Enumerations and ranges of forbidden values can be mixed in one set of brackets.
    "string"
    
        matches a literal string matching that given inside the double quotes.
    'string'
    
        matches a literal string matching that given inside the single quotes.
    
    These symbols may be combined to match more complex patterns as follows, where A and B represent simple expressions:
    
    (expression)
    
        expression is treated as a unit and may be combined as described in this list.
    A?
    
        matches A or nothing; optional A.
    A B
    
        matches A followed by B. This operator has higher precedence than alternation; thus A B | C D is identical to (A B) | (C D).
    A | B
    
        matches A or B.
    A - B
    
        matches any string that matches A but does not match B.
    A+
    
        matches one or more occurrences of A. Concatenation has higher precedence than alternation; thus A+ | B+ is identical to (A+) | (B+).
    A*
    
        matches zero or more occurrences of A. Concatenation has higher precedence than alternation; thus A* | B* is identical to (A*) | (B*).
    
    Other notations used in the productions are:
    
    /* ... */
    
        comment.
    [ wfc: ... ]
    
        well-formedness constraint; this identifies by name a constraint on well-formed documents associated with a production.
    [ vc: ... ]
    
        validity constraint; this identifies by name a constraint on valid documents associated with a production.
    

    References