Search code examples
cparsingcompiler-constructionprogramming-languages

Parsing a C family language out of order


C is parsed strictly in order, i.e. everything has to be declared before it is used; in particular, types must be declared before variables of those types. This makes sense because the grammar would be ambiguous if you didn't know what was the name of a type and what wasn't, e.g. a * b depends on whether a names a type.

On the other hand, some C family languages have the desirable property of relaxing this restriction (thus eliminating manual juggling of header files). I'm writing a parser for a C-superset language which is intended to likewise have that restriction relaxed, so now I need to figure out how to do it.

One method that occurs to me would be to do two passes. The first pass goes through everything, taking advantage of the fact that everything at the top level must be a declaration, not a statement, and picks up all the types. At this stage function bodies are left unexamined, just picked up as token streams delimited by matching braces. The second pass parses function bodies. Local declarations within a function would have to be in order, but that's not really a problem.

Are there any stumbling blocks in that method I haven't thought of?

How do compilers for C++, Java, C# etc. typically handle it for those parts of those languages that don't require declarations in order?


Solution

  • You don't have to do name resolution as you parse. First, if you are designing a "C-like" language (as opposed to a new C implementation), you can define your syntax so that declarations, expressions, methods, etc. are all unambiguous in the syntax. Then parsing order doesn't matter. (This would also fix the preprocessor disease, too, by integrating the preprocessor into the language in a structured way).

    If you insist on C-like syntax, you can use a parser that tolerates ambiguity, e.g., is happy to process "x*y;" and hold it as both an expression and declaration until it gets further data. In the extreme case, just think of this as a constraint-based resolution. C and C++ insisted on knowing definitions first because originally compiler memory space was pretty constrained and you couldn't just keep everything; that's no longer true. You don't have to insist on knowing the answer when you parse.

    We use GLR parsers for this in our DMS Software Reengineering Toolkit, and it is perfectly happy to parse C and C++11 just fine. We do name resolution after parsing; this isolates parsing and name resolution, making a much cleaner, easier to manage front end.