Search code examples
c++parsingtemplatescompiler-construction

How to parse C++'s template syntax with the recursive-descent parser?


While writing a C++ parser for learning purposes, I had a question.

int a, b;
template <typename T> T Foo() {}
int main() {
    bool x = a < b;
    auto y = Foo<int>();
}

Here, < symbol in a < is less-than operator. But in Foo<, that means the start of template parameters. Both are essentially identifier, followed by the < symbol, in expression. Can the recursive-descent parser handle this situation? Or can only more powerful parsers like the LR parser handle this?


Solution

  • Recursive-descent parsers are more powerful than LR parsers, not less. A recursive-descent parser is usually written in a Turing-complete programming language (such as C++), while an LR parser only has the theoretical power of a pushdown automaton.

    Now, recursive-descent parsers also typically have a regular, top-down structure similar to LL parsers, and since LL is less powerful than LR, it's understandable to suspect recursive-descent is also less powerful. But, because it has access to all of the features of a conventional programming language, a recursive-descent parser can use very powerful lookahead techniques, limited only by decidability, including arbitrary backtracking (which is essentially using the parser as its own lookahead mechanism).

    In any case, how does a typical recursive-descent parser handle the C++ left-angle-bracket ambiguity? First we have to know what we want to accomplish, and for that, C++17 17.2 ([temp.names]) paragraph 3 says:

    When a name is considered to be a template-name, and it is followed by a <, the < is always taken as the delimiter of a template-argument-list and never as the less-than operator. [...]

    So, in an expression context, if we see:

      identifier <
    

    then we need to look up identifier to determine if it should be "considered to be a template-name" (the rules for which are in paragraph 2 of the same section, and have some subtleties I'll omit), which is principally determined by looking up the identifier in the symbol table. If it is found to refer to a template, then we treat the < as starting a template argument list, otherwise as the less-than operator.

    A concrete example of the technique can be found in the Clang parser. In Parser::ParseUnqualifiedId, we have:

      // unqualified-id:
      //   identifier
      //   template-id (when it hasn't already been annotated)
      if (Tok.is(tok::identifier)) {
      ParseIdentifier:
        // Consume the identifier.
        IdentifierInfo *Id = Tok.getIdentifierInfo();
        SourceLocation IdLoc = ConsumeToken();
    
        [...]
    
        // If the next token is a '<', we may have a template.
        TemplateTy Template;
        if (Tok.is(tok::less))
          return ParseUnqualifiedIdTemplateId(
              SS, ObjectType, ObjectHadErrors,
              TemplateKWLoc ? *TemplateKWLoc : SourceLocation(), Id, IdLoc,
              EnteringContext, Result, TemplateSpecified);
    

    and then ParseUnqualifiedIdTemplateId contains code to check if the identifier is a template:

          TNK = Actions.isTemplateName(getCurScope(), SS,
                                       TemplateKWLoc.isValid(), Id,
                                       ObjectType, EnteringContext, Template,
                                       MemberOfUnknownSpecialization);
    

    and, if TNK indicates a template was found, code that parses the template argument list:

      // Parse the enclosed template argument list.
      SourceLocation LAngleLoc, RAngleLoc;
      TemplateArgList TemplateArgs;
      if (ParseTemplateIdAfterTemplateName(true, LAngleLoc, TemplateArgs, RAngleLoc,
                                           Template))
        return true;
    

    Meanwhile, surrounding code I have not shown deals with the case where the name is not considered to be a template name. In that case, it sets Result to just the identifier, then returns false. That (perhaps unexpectedly) means "successful parse", and bubbles up to Parser::tryParseCXXIdExpression which processes the identifier alone, leaving the < as the next token in the input token stream, where it will eventually be treated as less-than.

    There are other contexts in which a similar decision has to be made, and other ambiguities that require more mechanism to handle (including the right-angle-bracket ambiguity, which is its own can of worms), but this code illustrates the basic idea: when faced with a difficult choice, check the possibilities one by one, taking care to isolate any parser state changes caused by speculative parsing that didn't work out.

    But, having explained the basic idea, I should caution that parsing C++ is extremely difficult, far more difficult than any other language I'm aware of. If you're an experienced language implementor eager for a "final boss" challenge that will take years if not decades to complete, go for it. But if you just want to learn the basics, I strongly recommend picking a different language.