parsing grammar parser-generator lr-grammar

Repetitions Grammar parsing S -> ( E ';' )+

I know how to parse grammars like this:

E -> E '*' E
E -> E '+' E
E -> N
N -> '0'
N -> '1'

But what if I have the following example grammar (with a "regex repitition"):

E -> 'e'
S -> ( E ';' )+ // '+' used like in regexes

How can i parse such grammars with LR(k) (or some other parsing algorythm) and build a tree?

Solution

That's either

S → E ';'
S → S E ';'

S → E ';'
S → E ';' S

In your first snippet, which you say you "know how to parse", your grammar is ambiguous. I presume that you are parsing it with some kind of external precedence/associativity declarations, since the parse cannot be meaningfully applied to a text for purposes other than simple recognition without specifying how the parse tree is to be constructed.

The grammars I provide here are not ambiguous, and thus embody the associativity of the lists. In many cases, the associativity of a list is irrelevant, since the desired outcome is simply a list; in such as a case it doesn't matter (grammatically) which of the above alternatives you choose.

Typically when using an LR parser generator we'll choose the left-associative one, which is the first one above. That's because right-associativity requires keeping individual elements on the parser stack until the list can finally be constructed, back-to-front, when you get to the end. Thus, parsing a long list can use up a lot of parser stack; if your parser generator limits the size of the stack then that's eventually going to be a problem.

Back-to-front list construction can also be confusing for novices. A common confusion (judging from questions on SO) comes from the following "debugging" code (in yacc/bison syntax): (For simplicity, I implement (E ';')* instead of (E ';')+; most of the time, that's what you want anyway.)

list: %empty
    | element ';' list { printf("Adding %s\n", $1); }

This will result in the elements in the list being printed out right to left, which is fine if that's what you're expecting the code to do. But often it results in confusion, which is somewhat counter-productive for debugging code. (This is just one of the reasons why I always recommend using the debugging tools built into your parser generator - and always choosing a parser generator which has debugging tools built in -- rather than trying to create parser traces willy-nilly with a collection of ad hoc print statements.)

If the parser is, for example, part of an immediate calculator, back-to-front evaluation would clearly be a huge mistake. You want to evaluate and then discard the expressions one at a time, left-to-right, and left-associativity would be obligatory.

But that's not always the case. Suppose we're parsing for the purpose of constructing an AST (or some other intermediate product which will lead to code generation). And suppose the elements here are statements, while the list represents a block (minus the block delimiters, which will be attached in some outer production). In languages in which declarations in a block are local to the block and scoped from declaration to the end of the block, the semantics of the program strongly suggest right associativity. Consider the following slightly stupid example:

1    function example(i: integer)
2       var tmp: integer = i;
3       var i: integer = tmp - 1;
4       return tmp * i;
5    end

Here, the scope of tmp extends from statement 2 up to the end of statement 4. The scope of the i in the parameter list extends from statement 1 up to statement 5, but in statement 3 it is shadowed by the declaration of another variable with the same name, whose scope extends from statement 3 to the end of statement 4.

To parse this language meaningfully, we'll want to create a new subscope every time we see a declaration, and attach this subscope to the part of the program which starts just after the declaration. That would suggest a grammar something like this:

block: %empty
     | declaration ';' block { $3.set_scope(new Scope($1));
                               $3.prepend($1.initialiser()); }
     | statement ';' block   { $3.prepend($1); }

Just to make it clear: there are well-known idioms for converting

S → A B*

and

S → B* A

to context-free grammars. The first one is

S: A
 | S B

The second one is

S: A
 | B S

If A is B (in other words, if you want S → B+, which represents the same text as either S → B B* or S → B* B, you would use S: B | S B or S: B | B S. I didn't do that above because it involves repeating B and whatever action corresponds to it, which is annoying if B is not a single symbol. There's nothing wrong with repeating B (or creating an intermediate non-terminal to represent it, if it is really complicated or has a complicated action), but it was simpler to avoid the issue.