Search code examples
c++ccompiler-constructionlow-level

Compiler-Programming: What are the most fundamental ingredients?


I am interested in writing a very minimalistic compiler.

I want to write a small piece of software (in C/C++) that fulfills the following criteria:

  • output in ELF format (*nix)
  • input is a single textfile
  • C-like grammar and syntax
  • no linker
  • no preprocessor
  • very small (max. 1-2 KLOC)

Language features:

  • native data types: char, int and floats
  • arrays (for all native data types)
  • variables
  • control structures (if-else)
  • functions
  • loops (would be nice)
  • simple algebra (div, add, sub, mul, boolean expressions, bit-shift, etc.)
  • inline asm (for system calls)

Can anybody tell me how to start? I don't know what parts a compiler consists of (at least not in the sense that I just could start right off the shelf) and how to program them. Thank you for your ideas.


Solution

  • Firstly, you need to decide whether you are going to make a compiler or an interpreter. A compiler translates your code into something that can be run either directly on hardware, in an interpreter, or get compiled into another language which then is interpreted in some way. Both types of languages are turing complete so they have the same expressive capabilities. I would suggest that you create a compiler which compiles your code into either .net or Java bytecode, as it gives you a very optimized interpreter to run on as well as a lot of standard libraries.

    Once you made your decision there are some common steps to follow

    1. Language definition Firstly, you have to define how your language should look syntactically.

    2. Lexer The second step is to create the keywords of your code, known as tokens. Here, we are talking about very basic elements such as numbers, addition sign, and strings.

    3. Parsing The next step is to create a grammar that matches your list of tokens. You can define your grammar using e.g. a context-free grammar. A number of tools can be fed with one of these grammars and create the parser for you. Usually, the parsed tokens are organized into a parse tree. A parse tree is the representation of your grammar as a data structure which you can move around in.

    4. Compiling or Interpreting The last step is to run some logic on your parse tree. A simple way to make your own interpreter is to create some logic associated to each node type in your tree and walk through the tree either bottom-up or top-down. If you want to compile to another language you can insert the logic of how to translate the code in the nodes instead.

    Wikipedia is great for learning more, you might want to start here.

    Concerning real-world reading material I would suggest "Programming language processors in JAVA" by David A Watt & Deryck F Brown. I used that book in my compilers course and learning by example is great in this field.