Search code examples
ctypescode-generation

Algorithm to produce C type declarations from AST


I need to generate C code from an abstract syntax tree. Type declarations are tricky. Is there written down anywhere, an algorithm to do this?

There have been several previous questions here about going in the reverse direction. I'm not trying to parse C declarations, but to generate them.

I have tried examining the code for cdecl which is the only reasonably short program I know of that does this, but when I delete all the code that's clearly not the code for generating C declarations, I end up with an empty file, so I'm clearly missing something there.

The abstract syntax tree encodes C type semantics, where the kinds of nodes are:

  • Basic types (int, char etc)

  • Pointer to a type

  • Array of a type (with or without size)

  • Function taking a list of parameter types and returning a type

So the problem is just that of rendering this into C syntax.


Solution

  • If you’re able to implement the AST in C++, the STL provides std::type_info::name(), which returns the name of the type T. On some compilers, including MSVC, this returns the human-readable name. On others, including gcc, it returns a mangled name. GCC provides a abi::__cxa_demangle builtin to demangle it, and Boost provides a more portable function for this, boost::core::demangle.

    To achieve runtime polymorphism, you can use the typeid operator to get a std::type_info object at runtime, or you can have every type of AST node inherit from an abstract base class. That could have a pure virtual function ::name(), whose implementation might be something like

    #include <boost/core/demangle.hpp>
    #include <typeinfo>
    
    template<typename T>
      const char* AST<T>::name() {
        return boost::core::demangle( typeid(T).name() );
      }
    

    Or something using a bunch of #ifdef blocks to support different compilers.

    If you want to roll your own, these answers give some ways to concatenate constant strings at compile time using template metaprogramming. There would be several fiddly bits, such as: putting qualifiers such as const, volatile and signed into a canonical order, expressing a pointer to T(Args...) as T(*)(Args...), and the equivalent for pointers to arrays. If you wanted to eliminate all runtime overhead with template metaprogramming, you would probably define your own type_name<T> template, specialize it separately using std::enable_if and the <type_traits> library, and provide an override for compound types that concatenated the type names of the components with *, (, ), [ and ] in the proper way. If you can live with a small amount of overhead, you could just make them const std::string and concatenate them with +.

    You would want to do this from the bottom up, starting with names for the simple types, and then providing implementations for array-of-t, pointer-to-array-of-T, etc. that recursively look up the names of the building blocks. It’s very similar to pattern guards in a functional language.

    Doing it in C would be much, much harder.

    Edit

    Doing it in C would not necessarily be that much harder. (In our chat, you say you’re actually doing it in Kotlin.)

    You should consult the language standard (or a recent draft) for the formal grammar, but the subset of the language you’re interested in seems to have three cases:

    • Simple types, where you form a pointer by appending *,
    • Derived types, where you form a pointer by adding (*) in the middle, and
    • Pointers to derived types, where you already have something like (**) and you add another star to it. (Don’t ever actually write three-star code!)

    This can be written as an algorithm where types have a left, middle and right part. For simple types, left is the entire type, and the other two are empty. For arrays, left is the element type and right is the bounds. For function prototypes, left is the return type and right is the argument-list. Either way, middle is empty. For pointers to derived types, middle is the stars inside middle parentheses. So, for example, double(*)[4][4] has a left of double, a middle of * and a right of [4][4].

    Most operations have two special cases, depending on whether middle is empty or non-empty. (These could be implemented as specializations of generics, pattern matches, overloaded object methods, or a case block. Since we’re hypothetically doing this in C, we’d probably spell this strlen(middle) ? ... : ....)

    I’m going to spell string concatenation ⧺ from here on out.

    If a given type has an empty middle, its name in C is leftright, and a declaration is left ⧺ " " ⧺ nameright.

    If a given type has a non-empty middle, its name in C is left ⧺ "(" ⧺ middle ⧺ ")" ⧺ right and a declaration using it is left ⧺ "(" ⧺ middlename ⧺ ")" ⧺ right. For example, in the declaration int(*callback)(handle), left is int, middle is *, name is callback, and right is (handle).

    Forming pointers to a type has three cases: if there is a non-empty middle, append * to middle. If there is an empty middle and a non-empty right, set middle to *. If both middle and right are empty, append * to left.

    There are corner cases that need a bit more fiddling if you want to support const and volatile pointers, K&R style incomplete function types, or C++ references.