So in my < 24 hours of bison/flex investigation I've seen lots of documentation that indicates that left recursion is better than right recursion. Some places even mention that with left recursion, you need constant space on the Bison parser stack whereas right recursion requires order N space. However, I can't quite find any sources that explains what is going on explicitly.
As an example (parser that only adds and subtracts):
Scanner:
%%
[0-9]+ {return NUMBER;}
%%
Parser:
%%
/* Left */
expression:
NUMBER
| expression '+' NUMBER { $$ = $1 + $3; }
| expression '-' NUMBER { $$ = $1 - $3; }
;
/* Right */
expression:
NUMBER
| NUMBER '+' expression { $$ = $1 + $3; }
| NUMBER '-' expression { $$ = $1 - $3; }
;
%%
For the example of 1+5-2, it seems with left recursion, the parser receives '1' from the lexer and sees that '1' matches expression: NUMBER
and pushes an expression of value 1 to the parser stack. It sees + and pushes. Then it sees 5 and the expression(1), + and 5 matches expression: expression '+' NUMBER
so it pops twice, does the math and pushes a new expression with value 6 on the stack and then repeats for the subtraction. At any one point, there is at max, 3 symbols on the stack. So it's like an in-place calculation and operates left to right.
With right recursion, I'm not sure why it has to load all symbols on the stack but I'm going to attempt at describing why that might be the case. It sees a 1 and matches expression: NUMBER
so it pushes an expression with value 1 on the stack. It pushes '+' on the stack. When it sees the 5, my first thought is that 5 on it's own could match expression: NUMBER
and hence be an expression of value 5 and then it plus the last two symbols on the stack could match expression: NUMBER '+' expression
but my assumption is that because expression
is on the right of the rule, that it can't jump the gun and evaluate 5 as an expression as a NUMBER since with LALR(1), it already knows more symbols are coming so it has to wait until it hits the end of the list?
TL;DR;
Can someone maybe explain with some detail how Bison manages its parse stack relative to how it does its shift/reduction with the parser grammar rules? Silly/contrived examples welcome!
With LR (bottom-up) parsing, each non-terminal is reduced precisely when its last token is encountered. (LALR parsing is a simplified LR parse which handles lookahead slightly less precisely.) Until a non-terminal is reduced, all its components live on the stack. So if you use right recursion and you are parsing
NUMBER + NUMBER + NUMBER + NUMBER
the reductions won't start until you get to the end, because each NUMBER
starts an expression
and all the expressions end at the last NUMBER
.
If you use left recursion, each NUMBER
terminates an expression
, so a reduction happens each time a NUMBER
is encountered.
That's not the reason to use left-recursion though. You use left-recursion because it accurately describes the language. If you have 7 - 2 - 1
, you want the result to be 4, because that's what algebraic rules require: the expression is parsed as though it were (7 - 2) - 1
, so 7 - 2
must be reduced first. With right-recursion, you would incorrectly evsluate that as 6, because the 2 - 1
would reduce first.
Most operators associate to the left, so you use left-recursion. For the occasional operator which associates to the right, you need right recursion and you have to live with the stack growing. It's not a big deal. Your machine has tons of memory.
As an example, consider assignment. a = b = 42
means a = (b = 42)
. If you did it left associatively, you'd first set a
to b
, and then attempt to set something to 42; (a = b) = 42
wouldn't make sense in most languages and it is certainly not the expected action.
LL (topdown) parsing uses lookahead to predict which production will be reduced. It can't handle left-recursion at all because the prediction ends up in a recursive loop: expression
starts with an expression
which starts with an expression
… and the parser never manages to predict a NUMBER
. So with LL parsers, you have to use right-recursion and then your grammar does not correctly describe the language (assuming that the language has left-associative operators, as is usually the case). Some people don't mind this, but I think that grammars should actually indicate the correct parse, and I find the modification necessary to make a grammar parseable with a top-down parser to be messy and hard to read. Your mileage may vary.
By the way, "force down your throat" is a very ungenerous description of documentation which is trying to give you good advice. It is good to be skeptical -- you understand things better if you work at figuring out why they work the way they do -- but many people just want the good advice.