Search code examples
clangllvmcode-generationabstract-syntax-tree

How do I get started with traversing Clang's AST?


I am playing around with LLVM's frontend and would appreciate some help with general processing technqiues of the AST. In the CodeGen directory, I have a method that processes an instance of the CodeGenFunction class. I know that this CGF contains information about a call to a function foo() in my source file, so as far as I understand, this call would correspond to a node in the AST.

Inside that CodeGen method I would like to identify that node and process the information contained in it. When I dump the entire AST for my source file, I can identify the node I am interested in, but I need a more methodical way of traversing the AST.

In my CodeGen method I defined an Expr variable, and after dumping it to stderr I can see something like:

MemberExpr /*address*/ '<bound member function type>' -> foo /*address*/
`-ImplicitCastExpr /*address*/ 'class Base *' <UncheckedDerivedToBase> (Base)>
  `-ImplicitCastExpr /*address*/ 'Derived1 *' <LValueToRValue>
    `-DeclRefExpr /*address*/ 'Derived1 *' lvalue ParmVar /*address*/ 'obj1' 'Derived1 *'

I would like to access and process the line starting with -DeclRefExpr, but I am not sure how to do that. All help is very appreacited!

Also, I am still new to this, so if I am not thinking about it right please let me know. Any explanation is welcome!


Solution

  • Learning about the AST

    As a prerequisite to traversing the AST, one should start with an understanding of the structure itself.

    The Clang documentation has two overview articles:

    After reading the above, what I recommend is to print the AST for small examples by feeding them to clang -fsyntax-only -Xclang -ast-dump filename.cc, which produces output similar to the dump you showed for the MemberExpr. Then, each of the AST node classes has its own API documentation page, for example:

    What I find works the best is to go to https://clang.llvm.org/doxygen/namespaceclang.html and then do a text search for a name of interest.

    However, the API documentation only covers the public methods, whereas what you really want to understand is the private data. For that, you need to open or click through to the source code ("More", then the line number of the definition file) and look at the private data declarations. The private data is the "ground truth" of what the AST node contains, and often has useful commentary, but unfortunately neither the private declarations nor their comments get copied into the Doxygen output.

    Traversing the AST directly

    We can now apply this knowledge. The question asks how to navigate from MemberExpr to the contained DeclRefExpr. Based on the AST dump you showed, the original source code was something like:

    // Preliminary declarations.
    struct Base {
      void foo();
    };
    struct Derived1 : Base {};
    Derived1 *obj1;
    
    void f()
    {
      obj1->foo();
    //^^^^^^^^^ Expression whose AST we are examining.
    }
    

    By following the documentation and source code linked above, we find the following access path:

    • MemberExpr has private member Stmt *Base, which is the obj1 in obj1->foo. That private member is retrieved by calling getBase(), yielding an Expr*. (Note: Expr is derived from Stmt.)

    • We can downcast the Expr* to ImplicitCastExpr* using dyn_cast.

    • ImplicitCastExpr is derived from CastExpr, which has private data member Stmt *Op which is not documented but from context and its name must be the operand. That member is retrieved by calling getSubExpr(), yielding Expr* again.

    • Repeat for the nested ImplicitCastExpr.

    • DeclRefExpr has private member ValueDecl *D, which is the declaration of obj1, and can be retrieved using getDecl().

    Putting these steps into code, it would be something like (not tested):

    using namespace clang;
    MemberExpr *memberExpr = ...;
    ImplicitCastExpr *outerICE = dyn_cast<ImplicitCastExpr>(memberExpr->getBase());
    ImplicitCastExpr *innerICE = dyn_cast<ImplicitCastExpr>(outerICE->getSubExpr());
    DeclRefExpr *dre = dyn_cast<DeclRefExpr>(innerICE->getSubExpr());
    ValueDecl *objValueDecl = dre->getDecl();
    

    Of course, in a real analysis you should not blindly assume you know the structure of the AST in advance. Instead, at each step, you need to check what the type of the nested nodes are and treat them accordingly.

    There are three main ways to check the type of a node:

    • dyn_cast returns a nullable pointer, so a common idiom is if (auto derived = dyn_cast<DerivedType>(base)) {...}, which enters the block only when base is DerivedType.

    • isa, another template in the dyn_cast family, returns a boolean (which is basically the same as checking if dyn_cast returns non-null).

    • All of the AST root superclasses (of which the three most important are Stmt, Decl, and Type) have a getKind() or similar method that returns an enumeration. Thus, another common idiom is to use switch (base->getKind()) {...} and have one case per AST node kind. Be aware that the enumerators correspond to the most-derived type, so, for example, you would need multiple case statements to get all of the types derived from CastExpr, whereas dyn_cast and isa work fine with such "intermediate" classes. Under the hood, dyn_cast calls getKind().

    Finally, note that you cannot in general use ordinary C++ dynamic_cast. The Clang source code is normally compiled with RTTI disabled, and even if you change that, some of the AST classes do not have vtables.

    Traversing the AST using RecursiveASTVisitor

    Depending on what processing you are doing, it is sometimes convenient to use RecursiveASTVisitor to automatically traverse the tree and dispatch on the node types. It's a mixed bag, since it provides a lot of convenient automation at first, but can be quite challenging to work with if your analysis does not fit perfectly into its traversal system.

    To use this method, you initiate a traversal from the MemberExpr, and then at some point during that, VisitDeclRefExpr will be called. See How to write RecursiveASTVisitor based ASTFrontendActions for details. If that suffices, great! Otherwise, and especially if you are intending to do code generation (which often needs to do things in a particular order and with varying contexts), it may be better to stick with direct traversal for maximum control.

    Traversing the AST using match expressions

    Another method of AST inspection is to write AST match expressions, which let one write a pattern that is then compared to all of the input, with matching fragments passed to a handler. The article Tutorial for building tools using LibTooling and LibASTMatchers discusses this method in some detail.

    That method does not seem like a good fit for a code generator, but is good for things like searching for problematic code patterns.

    If you try the match expression route, it is convenient to test your match expressions using the clang-query tool, which is a useful general AST search/grep tool in its own right.