I am using clang ast matcher to extract some information fromt the source file. Now, I would also like to know the list of headers and dependency headers that the source file is using. For example, the source file abc.c has following header:
#include <def.h>
//#include <def_private.h>
During clang matcher, I need to make sure clang knows about the def.h, which is in the same directory. The def.h includes the following headers:
#include <iostream.h>
#include <string.h>
#include <float.h>
#include <math.h>
/*#include <boost>
* #inclde <fstream>*/
I do ast matcher to extract or identify information from abc.c. Now, I would like to extract all the headers or includes. This should include all of them:
#include <def.h>
#include <iostream.h>
#include <string.h>
#include <float.h>
#include <math.h>
I did some online research to do this, unfortunately all of them are involving regex (Regular expression to extract header name from c file) or how to do in visual studio (Displaying the #include hierarchy for a C++ file in Visual Studio).
I wonder if it is possible using clang. Also, please let me know if there is any other way to programmatically extract the headers that is more than just using regular expression.
OP says Any other way to programmatically extract the headers that is more than just using a regular expression. .... without clang is ok.
We both agree that regexes are simply incapable of doing this right. You need the source text parsed as a tree with the #include directives explicitl appearing in the tree.
I'm not a Clang expert. I suspect its internal tree reflects preprocessed source, so the #include constructs have vanished. The problem is then one of insisting on preprocessing the source text to parse it.
Our DMS Software Reengineering Toolkit with its C++17 capable parser can handle such parsing without expanding the directives. It can do this two ways: a) where preprocessor directives are "well structured" with respect to the source code, the C++ front end can be configured to capture a parse tree with the directives also parsed as trees in appropriate places; this works pretty well in practice at the price of sometimes having to hand-patch a particularly ugly conditional or macro call to make it "well structured, or b) parse capturing the preprocessor directives placed in (almost) arbitrary way; this captures the directives sometimes at the price of automatically duplicating small bits of code to in essence cause the good restructuring liked by case a).
In either case, the #include directives now appear explicitly in the AST, with the included file pretty much built as an auxiliary tree representing the included file. Such tree nodes easily found by a tree walk looking for such explicit include nodes. DMS's ASTInterface provides ScanTree to walk across nodes and taking actions when some provided predicate is true of a node; checking for #include nodes is easy. It is useful to note that becaause the conditional directives are also retained, by walking up the tree from a #include onr can construct the condition under which that include file is actually included.
Of course, the header file itself is also parsed, producing a tree. Any includes it has appear in its tree body. One would have to run ScanTree over each of these trees to collect all the includes.
OP didn't say what he wanted to do with the #includes. DMS provides a lot beyond parsing to help OP achieve her purpose, including symbol table construction, control and dataflow analysis, tree pattern matching, tree-to-tree transformations expressed in terms of source language (C++) syntax, and finally source code (re)generated from a modified syntax tree.