A friend and I were discussing imaginary and real languages and a question that came up was if one of us wanted to generate headers for another language (perhaps D which already has a tool) what would be an easy and very good way to do this?
One of us said to scan C files and headers and ignore function bodies and only count the braces within to figure out when a function is finished. The counter to that was typedefs, defines (which braces but defines were considered as a trivial problem) and templates + specialization.
Another solution was to read binaries produce, not the actual exe but the object files the linker uses. The counter to that was the format and complexity. None of us knew anything of any object format so we couldnt estimate (we were thinking of gcc and VS c++).
What do you guys think? Which is easier? This should be backed up with reasonable logic and fact.
If someone can link to a helpful project, one that parses C files/headers and outputs it or one that reads in elf data and displays info in an example project would be useful. I tried googling but I didnt know what it would be called. I found libelf but at this moment I couldn't get it to compile. I might be able to soon.
You can use clang libraries to parse C/C++ source code and extract any information you want in particular function prototypes.
Due to library-based architecture it is easy to reuse parts of clang that you need. In your case these are frontend libraries (liblex, libparse, libsema). I think this is a more feasible approach then using hand-written scanner considering the difficulties that you mentioned (typedefs, defines, etc).
clang
can also be used as a tool to parse the source code and output AST in XML form, for example if you have the file test.cpp
:
void foo() {}
int main()
{
foo();
}
and invoke clang++ -Xclang -ast-print-xml -fsyntax-only test.cpp
you'll get the file test.xml
similar to the following (here irrelevant parts skipped for brevity):
<?xml version="1.0"?>
<CLANG_XML>
<TranslationUnit>
<Function id="_1D" file="f2" line="1" col="6" context="_2"
name="foo" type="_12" function_type="_1E" num_args="0">
</Function>
<Function id="_1F" file="f2" line="3" col="5" context="_2"
name="main" type="_21" function_type="_22" num_args="0">
</Function>
</TranslationUnit>
<ReferenceSection>
<Types>
<FunctionType result_type="_12" id="_1E"/>
<FundamentalType kind="int" id="_21"/>
<FundamentalType kind="void" id="_12"/>
<FunctionType result_type="_21" id="_22"/>
<PointerType type="_12" id="_10"/>
</Types>
<Files>
<File id="f2" name="test.cpp"/>
</Files>
</ReferenceSection>
</CLANG_XML>
I don't think that extracting this information from binaries is possible at least for symbols with C linkage, because they don't have name mangling.