I am trying to build my first parser. Unfortunately I am not familiar with the theory of grammars and now I wonder whether it is
to have circular dependencies in my grammar. My intuition raises a yellow flag, but since I am not familiar with the theory of parsers, I am not sure.
Be my lexer well-defined and be its tokens the ones one would expect from their names, then I have the following grammar:
list_content : value
| list_content COMMA list_content
list : LBRACE list_content RBRACE
value : INT
| list
In there, value
depends on list
, list
depends on list_content
and list_content
depends on value
.
I have seen recursive definitions in grammars before, such as:
sum | NUMBER + NUMBER
| NUMBER + sum
| LBRACE sum RBRACE
However, I think, my circular definition is different (in terms of: dirtier), because it is harder to overview and the definitory circle spans multiple grammar rules. I am not sure, whether my circular definition creates an ambiguity in my grammar. I also fear it might make my code hard to debug.
So, I have two questions:
A) Should I restructure my grammar (and my lexer) or is it OK to live with this circular definition?
B) If I should restructure, how would I best do so?
A circular dependence like this is fine -- it's a recursive definition and is analogous to using recursion in a program. As such, the important thing to look at is how the base case is realized, since that is how the recusion terminates. If you don't have a base case (or it can't be reached without also triggering additional recursion), you have a problem -- an infinite loop that can never match any finite input.
In your case, the base case is the primitive
rule -- since value
can reduce to a single primitive
and list_content
to a single value
, everything is fine.
You do have one issue in your grammar in that the rule
list_content: list_content COMMA list_content
is ambiguous. What this means is that for any list with three or more elements (two or more COMMAs), there are multiple way to parse it -- either matching the left comma (left recursive) or the right comma (right recursive) first. This will cause problems with most parser tools that cannot deal with ambiguity, and in you case is probably irrelevant (you don't really care which way it is parsed, since you'll likely just concatenate the lists).
The fix for this is to rewrite the rule as a simple left- or right- recusive rule (but not both). Which you want to use depends on the parser style you are using -- for an LL (top down or recursive descent) parser, you want a right recursive rule. For an LR (bottom up or shift/reduce) parser, you (generally) want a left recursive rule.
list_content : value | list_content COMMA value ;
list_content : value | value COMMA list_content ;