Search code examples
syntaxindentationbnfebnfvertical-text

How to represent vertical alignment of syntax of code using BNF, EBNF or etc.?


How to say that (in BNF, EBNF, etc) any two or more letters are placed in the same vertical alignment

e.g In python 2.x, we have what we call indentation.

def hello():
    print "hello," 
    print "world"

hello()

Note letter p (second line) is placed in the same vertical alignment of letter p (third line)

Further example (in markdown):

MyHeader
========
topic
-----

Note M and the first = are placed in the same vertical alignment (also r and last =, t and first -, c and last -)

My question is How to represent these vertical alignment of letters using BNF, EBNF or etc.?

Further note: My point of this question is searching for a representation method to represent a vertical alignment of code, not just want to know how to write BNF or EBNF of Python or Markdown.


Solution

  • You can parse an indentation-sensitive language (like Python or Haskell) by using a little hack, which is well-described in the Python language reference's chapter on lexical analysis. As described, the lexical analyzer turns leading whitespace into INDENT and DEDENT tokens [Note 1], which are then used in the Python grammar in a straightforward fashion. Here's a small excerpt:

    suite         ::=  stmt_list NEWLINE | NEWLINE INDENT statement+ DEDENT
    statement     ::=  stmt_list NEWLINE | compound_stmt
    stmt_list     ::=  simple_stmt (";" simple_stmt)* [";"]
    while_stmt    ::=  "while" expression ":" suite ["else" ":" suite]
    

    So if you are prepared to describe (or reference) the lexical analysis algorithm, the BNF is simple.

    However, you cannot actually write that algorithm as a context free grammar, because it is not context-free. (I'll leave out the proof, but it's similar to the proof that anbncn is not context free, which you can find in most elementary formal language textbooks, and all over the internet.)

    ISO standard EBNF (a free PDF is available) provides a way of including "extensions which a user may require": a Special-sequence, which is any text not containing a ? surrounded on both sides by a ?. So you could abuse the notation by including [Note 2]:

    DEDENT = ? See section 2.1.8 of https://docs.python.org/3.3/reference/ ? ;
    

    Or you could insert a full description of the algorithm. Of course, neither of those techniques will allow a parser generator to produce an accurate lexical analyzer, but it would be a reasonable way of communicating intent to a human reader.

    It's worth noting that EBNF itself uses a special sequence to define one of its productions:

    (* see 4.7 *) syntactic exception
       = ? a syntactic-factor that could be replaced
           by a syntactic-factor containing no
           meta-identifiers
         ? ;
    

    Notes

    1. The lexical analyzer also converts some physical newline characters into NEWLINE tokens, while making other newline characters vanish.

    2. EBNF normally uses the syntax = rather than ::= for a production, and insists that they be terminated with ;. Comments are enclosed between (* and *).