Search code examples
javac++sonarqubestatic-analysiscoverity

When using a SAST tool, why do we have to use a "build wrapper" for compiled languages (e.g. C/C++)?


I am new to SAST tools. It's amazing to run those tools and find out bugs that are sometimes obvious but we just didn't notice.

While I know how to run the tools, I still have many questions in mind how these incredible tools work under the hood.

For example, while using SonarQube or Coverity to scan C/C++ source codes, we have to use a build-wrapper so the tool can monitor the build process. However, for other interpreted langaues, these tools can just take a look at the codes and still function very well.

I could envision that the tools are building the relationship between source codes(function calls/variables/memory alloc or dealloc), what is the reason that for a compiled language the tool has to meddle into the build process?


Solution

  • A static analysis tool needs to know what the code means. For compiled languages, the meaning of the code often depends on how the compiler is invoked. For C/C++, that includes things like -D (macro definition) and -I (include path) options, as the former often controls the behavior of #ifdef and the latter is used to find headers for third-party libraries (among other thngs). For Java, the compilation command includes the -classpath option, which again is how third-party dependencies are found. Other compiled languages are similar.

    It is important to locate the correct dependencies both because that can affect the way the code should be parsed and what the behavior is. As an example of the former, consider that, in Java, the expression a.b.c.d.e.f could mean many things, since the . operator is used both to navigate in the package hierarchy and to dereference an object to access a field. If a comes from the classpath, the tool can't know what this means without inspecting the classes in that classpath. As an example of the latter, consider a function in a third-party library that accepts an object reference. Does that function allow a null reference to be passed? Unless it is a well-known function that the tool already knows about, the only way to tell is for the analyzer to inspect the bytecode of that function.

    Now, a tool could just ask the user to provide the compilation information directly when invoking the analyzer. That is the approach taken by clang-tidy, for example. This is conceptually simple, but it can be a challenge to maintain. In a large project, there may be many sets of files that are compiled with different options, making this a pain to set up. And possibly worse, there's no simple and general way to ensure the options passed to the analyzer and the set of files to analyze are kept in sync with the real build.

    Consequently, some tools provide a "build monitor" that can wrap the usual build, inspecting all of the compilations it performs, and gathering both the set of source files to analyze and the options needed to compile them. When that is finished, the main analysis can begin. With this approach, nothing in the normal build has to be modified or maintained over time. This isn't entirely without potential issues, however. The tool may need to be told, for example, what the name of your compiler executable is (which can vary a lot in cross-compile scenarios), and you have to ensure the normal build performs a full build from a "clean" state, otherwise some files may be missed.

    Interpreted languages are usually different because they often have dependencies specified by environment variables that the analyzer can see. When that isn't the case, the analyzer will generally accept additional configuration options. For example, if the python executable on the PATH is not what will be used to run Python scripts being analyzed, the analyzer can typically be told to emulate a different one.

    Tangent: At the end of your question, you jokingly refer to this process as "meddling". In fact, these tools try very hard not to have any observable effect on the normal build. The paper A Few Billion Lines of Code Later (of which I am one of the authors) has some amusing anecdotes of failures to be transparent.