Search code examples
c++parsingheaderprototype-programming

Parsing irregular c++ prototypes


I am trying to build a program that parses and lists the content of header files. So far, so good, I found it easy parsing and listing headers I wrote, but when I started parsing cross platform API headers things got messy.

My current approach is rather simplistic, here is a pseudocode example of parsing the following function:

void foo(int a);

void is a type, so we are dealing with instancing a type
foo is the name of that type
foo is followed by brackets, meaning it is a function of type void named foo
   int is a type...
   a is the name of that type instance
foo is a function of type void that takes one parameter of type int named a

However, when I got into bigger and more complex headers I stumbled upon somewhat irregular prototypes, involving macros and god knows what. An example:

GLAPI void APIENTRY glEvalCoord1d( GLdouble u );

GLAPI and APIENTRY are platform dependent macros. Which kind of spoils my simple parsing scheme, since it expects the name of an object to follow its type. Those two macros happen to translate to either __stdcall, __declspec(dllimport) or extern but in theory they could mean anything, with their meaning being unclear until compile time.

How to write my parser so it can deal with such scenarios and not get confused? The macros themselves are defined at an earlier stage, so the parser can be aware GLAPI and APIENTRY are macros so they can simply be ignored, is this the way to go? Naturally this is just one of the many variations of irregularities the parser may stumble upon parsing through different headers, so any general techniques of how to deal with the parsing of any "legal" header content are welcome.


Solution

  • There isn't any real alternative to expanding the macros before you parse, at least if you want process header files with the same complexity as Microsoft's, or any other header files associated with a compiler system that has been around for 10 years or more.

    The unpreprocessed source code is NOT C; it is simply unpreprocessed source code. The macros (and prepreprocessor conditionals which you surprising didn't mention) can edit the apparant source in not arbitrary but spectacularly complex fashion. And you can't often know what the macros used, or conditionals expanded, unless you process the #includes as well.

    You can get GCC to do preprocessor expansion for you, and then parse it. That would be far the easiest way to approach this.

    That still leaves the problem of parsing real C code, with all the complexities of declarators, and ambiguities in fragments suchas T X; where the meaning of the statement depends on the declaration of T. To parse the headers accurately, you need a full C parser.

    Our C Front End can do full preprocessing, or you can invoke it a mode in which some macros are expanded, and some are not. By tuning this set, you often parse such headers without exapanding every macro. Preprocessor conditionals are much more difficult, because they can occur at inconvenient (unstructured) places.