Search code examples
cparsingsyntaxpeg

How to handle ambiguity in syntax (like in C) in a Parsing Expression Grammar (like PEG.js)


So from my limited understanding, C has syntax ambiguity as seen in the expression:

T(*b)[4];

Here it is said about this sort of thing:

The well-known "typedef problem" with parsing C is that the standard C grammar is ambiguous unless the lexer distinguishes identifiers bound by typedef and other identifiers as two separate lexical classes. This means that the parser needs to feed scope information to the lexer during parsing. One upshot is that lexing must be done concurrently with parsing.

The problem is it can be interpreted as either multiplication or as a pointer depending on context (I don't 100% understand the details of this since I'm not expert in C, but I get the gist of it and why it's a problem).

typedef a;
b * a;                    // multiplication
a * b;                    // b is pointer to type a 

What I'm wondering is if you were to parse C with a Parsing Expression Grammar (PEG) such as this C grammar, how does it handle this ambiguity? I assume this grammar is not 100% correct because of this problem, and so am wondering how you would go about fixing it. What does it need to keep track of or do differently to account for this?


Solution

  • The usual way this is handled in a PEG grammar is to use a semantic predicate on a rule such that the rule only matches when the predicate is true, and have the predicate check whether the name in question is a type in the current context or not. In the link you give, there's a rule

    typedefName : Identifier
    

    which is the (only) one that needs the semantic predicate to resolve this ambiguity. The predicate simply checks the Identifier in question against the definitions in the current scope. If it is not defined as a type, then it rejects this rule, so the next lower priority one will (try to) match.