Search code examples
parsinggrammaryacc

Circular Dependency in Parser Grammar


I am trying to build my first parser. Unfortunately I am not familiar with the theory of grammars and now I wonder whether it is

  • plainly forbidden
  • just a bad idea or
  • kind of OK

to have circular dependencies in my grammar. My intuition raises a yellow flag, but since I am not familiar with the theory of parsers, I am not sure.

Be my lexer well-defined and be its tokens the ones one would expect from their names, then I have the following grammar:

list_content    : value
                | list_content COMMA list_content

list            : LBRACE list_content RBRACE

value           : INT
                | list

In there, value depends on list, list depends on list_content and list_content depends on value.

I have seen recursive definitions in grammars before, such as:

sum             | NUMBER + NUMBER
                | NUMBER + sum
                | LBRACE sum RBRACE

However, I think, my circular definition is different (in terms of: dirtier), because it is harder to overview and the definitory circle spans multiple grammar rules. I am not sure, whether my circular definition creates an ambiguity in my grammar. I also fear it might make my code hard to debug.

So, I have two questions:

A) Should I restructure my grammar (and my lexer) or is it OK to live with this circular definition?

B) If I should restructure, how would I best do so?


Solution

  • A circular dependence like this is fine -- it's a recursive definition and is analogous to using recursion in a program. As such, the important thing to look at is how the base case is realized, since that is how the recusion terminates. If you don't have a base case (or it can't be reached without also triggering additional recursion), you have a problem -- an infinite loop that can never match any finite input.

    In your case, the base case is the primitive rule -- since value can reduce to a single primitive and list_content to a single value, everything is fine.

    You do have one issue in your grammar in that the rule

    list_content: list_content COMMA list_content
    

    is ambiguous. What this means is that for any list with three or more elements (two or more COMMAs), there are multiple way to parse it -- either matching the left comma (left recursive) or the right comma (right recursive) first. This will cause problems with most parser tools that cannot deal with ambiguity, and in you case is probably irrelevant (you don't really care which way it is parsed, since you'll likely just concatenate the lists).

    The fix for this is to rewrite the rule as a simple left- or right- recusive rule (but not both). Which you want to use depends on the parser style you are using -- for an LL (top down or recursive descent) parser, you want a right recursive rule. For an LR (bottom up or shift/reduce) parser, you (generally) want a left recursive rule.

    • left recursive: list_content : value | list_content COMMA value ;
    • right recursive: list_content : value | value COMMA list_content ;