Search code examples
compiler-constructionbootstrapping

How can I maintain a language boostrap?


I'm working on a compiler which will be bootstrapped, i.e. the compiler will compile itself from source. I'm implementing a stripped down C++ version of the compiler as a stepping stone. My concern is that, in time, the real compiler will support features that the C++ implementation does not, and the compiler source code will leverage those features. As soon as that happens, the C++ implementation will not be usable, and the ability to start from nothing is lost.

The git history can be reused to start from a previous state and work towards the end result. Another option is to ensure the C++-based compiler is always able to compile the "real" compiler, either by extending the C++ version as needed, or never allowing the use of unsupported features in the real compiler's source. These both have their own disadvantages. I'd like to know if there are other techniques that have worked well for other languages.

What's a good approach for maintaining the bootstrap capability of a programming language?

For the curious, checkout http://plange.tech


Solution

  • It depends what is the target language of your compiler.

    A general advice is to keep some translated variant of your compiler. Read about Tombstone diagrams and partial evaluation (notably Futamura projections).

    Ocaml contains a bytecode variant (running on a quite stable bytecode virtual machine) and is keeping the (portable) bytecode of the ocamlc compiler in its git or svn repository (see boot/ocamlc from the official repository).

    In my GCC MELT (which is a Lisp-like language translated to C++ suitable as GCC plugin, sadly I don't have time to maintain it), I kept the translated and generated C++ code in generated/ sub-directory in the svn repository.

    Many compilers generating C code (e.g. bigloo, chicken, ...) are keeping the generated C code (even in the version control repository). C is so universal (and can be viewed as a nearly portable assembler) that it is often used as the target language of many experimental languages implementations. A lot of experimental compilers are targetting C. And you might maintain several targets (one of them being C). Or you could have also a naive interpreter (to be able to run your bootstrapping compiler).

    Apparently (according to your comments), you choose LLVM as your target language. Then you depend on the stability of the LLVM language (specification). You'll better keep your translated form as LLVM assembly language (in textual form), e.g. in your version control repository (since that translated form is as precious as your source code).

    You'll then be in trouble if the LLVM assembly language evolves incompatibly. When that would happen, you could transform the textual form (and that is why a textual form is preferable, since it is nicer to transform) to the newer LLVM syntax (assuming it has not changed a lot), or to some similar language like low-level C or Gimple, or use a previous LLVM code generator for a while. BTW, C is successful as a target language for similar reasons: it has evolved mostly compatibly, and is widely used.

    Go, D and Rust are keeping the binary ELF executable for common systems and architectures (Linux/x86-64). IIRC some of them are downloading (from a stable URL) the previous binary of the compiler during installation.

    The Bones Scheme compiler generates directly x86-64 assembler so it distributes its assembler code with its source code.

    Some compiling implementing a (superset of) some standard language (e.g. SBCL, for Common Lisp) take care to be compilable on several platforms (so SBCL can be bootstrapped on CLisp). So the core SBCL compiler is coded in strict and portable Common Lisp.

    Your incremental bootstrapping idea is a well-known approach. J.Pitrat's blog contains several entries on that idea.

    Practically speaking, you should "cold-bootstrap" your implementation quite often (e.g. starting from the translated variant in your version control repository), at least weekly. Bootstrap failures are painful bugs. From time to time (certainly at each release, probably more often) you'll copy the newly translated form to your version control repository. Be sure to run extensive tests after that.

    Probably that "cold-bootstrap" process would use several hours of CPU, but you need to run it reasonably often (at least monthly, and probably more often), to ensure that your implementation is in sane state. You would even test that the generated translated variant is able to recompile itself in several stages. BTW, GCC bootstraps in at least 3 stages (requiring several hours of computer for that), for good reasons.