Search code examples
c++parsingcompiler-construction

Where does context sensitivity get resolved in the C++ compilation process?


Yesterday I asked about C++ context sensitivity, see here. Among many excellent answers, here is the accepted one, by dmckee.

However, I still think there's something to be said about this (maybe some terminological confusion?). The question amounts to: what part of compilation deals with the ambiguity?

To clarify my terminology: A CFG is a grammar that has only one non-terminal symbol on the left-hand-side of the rule (eg. A->zC), a CSG is one that has a terminal (plus a non-terminal) on the left-hand-side (aAv->QT), where uppercase letters are nonterminals and lowercase are terminals.

Is any representation like the latter in the grammar parsing C++ source code?

Thank you, and sorry to push the issue.


Solution

  • No C++ front end (parser, name/type resolver) that I know of (including the one we built) implements a context sensitive parser using CSG grammar rules as you defined it. Pretty much they operate explicitly or implicitly with a context free grammar which still has ambiguities.

    Many of them use a combination of top-down parsing and interwoven collection of type information to distinguish between the ambiguous cases.

    Weaving the parsing and type collection together makes building a parser this way really messy and hard, producing the folk theorem "C++ is hard to parse".

    Like many such "theorems", it isn't true, unless you also insist on parsing with one hand tied behind your back (e.g., recursive descent, LL(k), LALR [e.g., YACC]). If you use a parsing technology that can handle ambiguities, the C++ grammar is actually not so hard. We (and others, such as the Elsa C++ parser) use GLR parsing technology for this reason. (We go a bit further and capture macro definitions, uses and preprocessor conditionals in the C++ grammer because we are interested in transforming the code before the processor has ruined it; normally preprocessor directives are treated completely separately in an independent preprocessor).

    Elsa I think still interleaves the ambiguity resolution into the parsing process, but because the parsing is so clean this is easier to do. Our front end builds an AST with ambiguity nodes, and after the tree is built, we walk the tree using an attribute grammar evaluator to collect names, and types, and eliminate those branches of ambiguities that are type inconsistent. The latter is a really beautiful scheme, and completely decouples parsing and name resolution.

    What is hard is actually doing the name resolution. C++ has a pretty arcane scheme for looking things up, and its spread across the 600 pages of the standard, as well as bent in various ways by various dialects. Keeping name resolution separate from parsing makes this ugliness more manageable, but the folk theorem should be "C++ is hard to name resolve".