Search code examples
clangllvmabstract-syntax-treeclang-ast-matchersclang-query

Writing AST Matcher expressions using an AST tree dump as a guide


I am new to clang-query and I need help to understand how to build ASTMatcher expressions for a code profiling tool I am trying to create. Eventually I would like to combine these AST matchers with a libTooling application to rewrite the source code.

Here is where I am havinve trouble. As a good starting point, I would like to break apart conditional compound statements into their individual sub-expressions, but I do not know how to do this and the AST matcher reference overwhelming for a newbie to clang.

Here is an example of a block of 'C' code I would like to analyze:

#include <stdbool.h>

// example code
int foo(const int* pa, int *pb) {
    if (pa[0] > 1) {
        pb[0] = 1;
    } else if ((pa[0] == 1) && (pa[1] == 1)) {
        pb[0] = 2;
    } else if ((pa[0] == 0) && (pa[1] == 1)) {
        pb[0] = 3;
    } else if ((pa[0] == 1) && (pa[1] == 1) || (pb[0] == 2)) {
        pb[0] = 4;
    }

    pb[0] = 2;
    switch (pa[0]) {
    case 0:
    case 1:
        break;
    case 2:
        break;

    default:
        break;
    }
    return 3;
}

int bar (int arg1, bool arg2) {
    return arg1 + (arg2 ? 1 : 0);
}

int main(int argc, const char **argv) {
    const int a[3] = { 0, 1, 2};
    int b[3] = {-1,-1,-1};
    foo(a, b);
    bar (3, true);
}

Using clang-query, I would like to identify all the compound conditional statements in the above code but I am having a tough time doing so. From what I understand, I should be able to use the AST dump to build AST matcher expressions.

As a very simple starting point, I can identify all the if statement conditions as follows:

clang-query> match ifStmt()

Match #1:

C:\Users\johnc\main\rewriter\.\input04.c:5:5: note: "root" binds here
    if (pa[0] > 1) {
    ^~~~~~~~~~~~~~~~

Match #2:

C:\Users\johnc\main\rewriter\.\input04.c:7:12: note: "root" binds here
    } else if ((pa[0] == 1) && (pa[1] == 1)) {
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Match #3:

C:\Users\johnc\main\rewriter\.\input04.c:9:12: note: "root" binds here
    } else if ((pa[0] == 0) && (pa[1] == 1)) {
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Match #4:

C:\Users\johnc\main\rewriter\.\input04.c:11:12: note: "root" binds here
    } else if ((pa[0] == 1) && (pa[1] == 1) || (pb[0] == 2)) {
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4 matches.
clang-query>

however I would also like to identify the individual sub expressions. I turned on the ast tree dump & reran the matcher as follows (I trimmed the output to just show the 4th if statement match):

This match encloses these sub expressions

((pa[0] == 1) && (pa[1] == 1) || (pb[0] == 2))

Whare are the steps I need to follow (presumably looking at AST nodes to identify:

((pa[0] == 1) && (pa[1] == 1) || (pb[0] == 2))
(pa[0] == 1)
(pa[1] == 1)

.

clang-query> enable output dump
clang-query> match ifStmt()

Match #1:
. . .

Match #2:
. . .

Match #3:
. . .
    
Match #4:

C:\Users\johnc\main\rewriter\.\input04.c:11:12: note: "root" binds here
    } else if ((pa[0] == 1) && (pa[1] == 1) || (pb[0] == 2)) {
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Binding for "root":
IfStmt 0x13e55687480 <C:\Users\johnc\main\rewriter\.\input04.c:11:12, line:13:5>
|-BinaryOperator 0x13e55687380 <line:11:16, col:59> 'int' '||'
| |-BinaryOperator 0x13e55687260 <col:16, col:43> 'int' '&&'
| | |-ParenExpr 0x13e55687140 <col:16, col:27> 'int'
| | | `-BinaryOperator 0x13e55687120 <col:17, col:26> 'int' '=='
| | |   |-ImplicitCastExpr 0x13e55687108 <col:17, col:21> 'int' <LValueToRValue>
| | |   | `-ArraySubscriptExpr 0x13e556870c0 <col:17, col:21> 'const int' lvalue
| | |   |   |-ImplicitCastExpr 0x13e556870a8 <col:17> 'const int *' <LValueToRValue>
| | |   |   | `-DeclRefExpr 0x13e55687060 <col:17> 'const int *' lvalue ParmVar 0x13e55545a30 'pa' 'const int *'
| | |   |   `-IntegerLiteral 0x13e55687080 <col:20> 'int' 0
| | |   `-IntegerLiteral 0x13e556870e0 <col:26> 'int' 1
| | `-ParenExpr 0x13e55687240 <col:32, col:43> 'int'
| |   `-BinaryOperator 0x13e55687220 <col:33, col:42> 'int' '=='
| |     |-ImplicitCastExpr 0x13e55687208 <col:33, col:37> 'int' <LValueToRValue>
| |     | `-ArraySubscriptExpr 0x13e556871c0 <col:33, col:37> 'const int' lvalue
| |     |   |-ImplicitCastExpr 0x13e556871a8 <col:33> 'const int *' <LValueToRValue>
| |     |   | `-DeclRefExpr 0x13e55687160 <col:33> 'const int *' lvalue ParmVar 0x13e55545a30 'pa' 'const int *'
| |     |   `-IntegerLiteral 0x13e55687180 <col:36> 'int' 1
| |     `-IntegerLiteral 0x13e556871e0 <col:42> 'int' 1
| `-ParenExpr 0x13e55687360 <col:48, col:59> 'int'
|   `-BinaryOperator 0x13e55687340 <col:49, col:58> 'int' '=='
|     |-ImplicitCastExpr 0x13e55687328 <col:49, col:53> 'int' <LValueToRValue>
|     | `-ArraySubscriptExpr 0x13e556872e0 <col:49, col:53> 'int' lvalue
|     |   |-ImplicitCastExpr 0x13e556872c8 <col:49> 'int *' <LValueToRValue>
|     |   | `-DeclRefExpr 0x13e55687280 <col:49> 'int *' lvalue ParmVar 0x13e55545ae0 'pb' 'int *'
|     |   `-IntegerLiteral 0x13e556872a0 <col:52> 'int' 0
|     `-IntegerLiteral 0x13e55687300 <col:58> 'int' 2
`-CompoundStmt 0x13e55687468 <col:62, line:13:5>
  `-BinaryOperator 0x13e55687448 <line:12:9, col:17> 'int' '='
    |-ArraySubscriptExpr 0x13e55687400 <col:9, col:13> 'int' lvalue
    | |-ImplicitCastExpr 0x13e556873e8 <col:9> 'int *' <LValueToRValue>
    | | `-DeclRefExpr 0x13e556873a0 <col:9> 'int *' lvalue ParmVar 0x13e55545ae0 'pb' 'int *'
    | `-IntegerLiteral 0x13e556873c0 <col:12> 'int' 0
    `-IntegerLiteral 0x13e55687420 <col:17> 'int' 4

4 matches.
clang-query>

Solution

  • The question title mentions the AST tree dump. The dump is mainly helpful for finding the relevant "node matcher". For example, in the AST dump, we see both IfStmt and BinaryOperator, so we know we may want to use the ifStmt and binaryOperator matchers.

    However, the AST dump does not tell us how to combine these, nor how to use the "traversal" and "narrowing" matchers. That comes from reading the examples in the AST Matcher Reference (that you also linked), plus trial and error (and asking questions on SO). It's not very easy or intuitive, unfortunately.

    The specific task you've described is, as I understand it, to report binary operator expressions that are subexpressions of the condition of an if statement. Here is a clang-query command to do that:

    clang-query \
      -c='m
        binaryOperator(            # We are looking for a binary operator expression
          hasType(booleanType()),  #   that has (C++) boolean type,
    
          anyOf(                   #   and that either
            hasParent(             #     has as its immediate parent
              ifStmt()             #       an 'if' statement,
            ),
    
            hasAncestor(           #     or has any ancestor
              expr(                #       that is an expression
                hasParent(         #         with immediate parent that is
                  ifStmt()         #           an 'if' statement.
                )
              )
            )
          )
        )
      ' \
      test.cc --
    

    (The comments in the above command are recognized and discarded by clang-query, not the shell.)

    When given the following input as test.cc:

    // test.cc
    // Report subexpressions of 'if' conditions.
    
    void g(int *pa, int *pb)
    {
      if ((pa[0] == 1) && (pa[1] == 1) || (pb[0] == 2)) {
        pb[0] = 4;
      }
    }
    
    // EOF
    

    it reports:

    Match #1:
    
    <path>/test.cc:6:7: note: "root" binds here
      if ((pa[0] == 1) && (pa[1] == 1) || (pb[0] == 2)) {
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    Match #2:
    
    <path>/test.cc:6:7: note: "root" binds here
      if ((pa[0] == 1) && (pa[1] == 1) || (pb[0] == 2)) {
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    Match #3:
    
    <path>/test.cc:6:8: note: "root" binds here
      if ((pa[0] == 1) && (pa[1] == 1) || (pb[0] == 2)) {
           ^~~~~~~~~~
    
    Match #4:
    
    <path>/test.cc:6:24: note: "root" binds here
      if ((pa[0] == 1) && (pa[1] == 1) || (pb[0] == 2)) {
                           ^~~~~~~~~~
    
    Match #5:
    
    <path>/test.cc:6:40: note: "root" binds here
      if ((pa[0] == 1) && (pa[1] == 1) || (pb[0] == 2)) {
                                           ^~~~~~~~~~
    5 matches.
    

    This isn't perfect, though, because it will also report subexpressions of the "then" or "else" clause if no braces are used to enclose them. For example, if given:

      if (pa)
        x = (pa[0] == 1)? 1 : 2;
    

    it will report:

        x = (pa[0] == 1)? 1 : 2;
             ^~~~~~~~~~
    

    I'm not sure if that can be fixed with the AST matcher language alone, nor whether it is a problem for your intended use.

    Additional notes regarding this query:

    • Using hasType(booleanType()) means this matcher only works with C++, not C, since there is no built-in boolean type in C. If working with C, the closest equivalent would be hasType(asString("int")).

    • The use of binaryOperator means this only matches binary operator expressions. If we want to match unary ! (NOT) expressions, use unaryOperator or just expr. (The query above specifically used binaryOperator since BinaryOperator appears in the AST dump and the question seemed focused on mapping the dump to a matcher.)

    • The binary ^ (XOR) operator yields int (or another promoted arithmetic type, depending on the operand types) since it is a bitwise rather than logical operator. Consequently the booleanType filter will cause it to not be reported directly. However, in an expression like !(c ^ d), there is an implicit conversion to bool associated with the parenthesis expression in the clang AST, so the parenthesis expression will be reported so long as IgnoreUnlessSpelledInSource is not also set. (The reason the query above uses a type filter is because the question appears interested in dissecting the logical components of the if condition, rather than every constituent subexpression, such as all of the identifier expressions.)