I'm trying to write a yacc (menhir) grammar for parsing a very reduced subset of C++ (no templates, headers only with no function bodies allowed...) and am already running into ambiguities.
typedef int B;
class A {
A(); // (*)
B(c)(); // (**)
};
Case * is a constructor and case ** is a parenthesized declarator. How does the parser tell the difference? I can imagine some ways I could tell, but I want to know how a compliant c++ parser does it. I also understand that I likely won’t be able to parse even a practical subset of C++ with yacc, but I just want to understand a bit better what’s going on. In the end maybe I’ll switch to a parser combinator or something. Also, I should note that I have zero interest in linking against clang as I intend to add some custom syntax.
Parsing C++ is frustrating exercise, because C++ is essentially not context-free. You need to know whether an identifier refers to a template, a type, or something else, and name resolution in C++ is not a simple task either. You might, for example, have to instantiate a template in order to know whether a member of templated class is a variable or a typename. (Programs only have to use typename
for names in dependent types.)
So you certainly cannot expect to parse a translation unit without considering included header files. (Anyway, you couldn't do that because a header file could define a macro whose expansion changes the syntax in unexpected ways. So you need a preprocessor.)
When I wrote this answer, my eyes had skipped over the parentheses in which OP noted that by "yacc" they meant "menhir". Sorry. Unfortunately, menhir does not generate GLR parsers, as far as I know. But I think the approach is still valid, and useful enough that I'm leaving it in place. Perhaps there is an OCaml GLR (or GLL) parser generator, or perhaps C/C++ generated code is still of use.
Having said all that, all is not lost. You could use bison to generate a GLR parser, and then try to write custom disambiguation functions. If the disambiguation function needs to know whether a name is a typedef or not, it can try to look it up in the symbol table. Unlike the classic "lexer hack" approach, GLR disambiguation happens in the parser, so it doesn't require building unwieldy backchannels.
This can be an iterative process, because you don't need to figure out how to resolve ambiguities which are not present in the particular code you are trying to compile. Eventually, you'll probably want to handle all of them, but doing them as needed might prove to be a more fulfilling experience.