Search code examples
v8compiler-theory

Is this an intermediate representation?


I'm looking into how the v8 compiler works. I read an article which states source code is tokenized, parsed, an AST is constructed, then bytecode is generated (https://medium.com/dailyjs/understanding-v8s-bytecode-317d46c94775)

Is this bytecode an intermediate representation?


Solution

  • Short answer: No. Usually people use the terms "bytecode" and "intermediate representation" to mean two different things.

    Long answer: It depends a bit on your definition (but for most definitions, "no" is still the right answer).

    "Bytecode" in virtual machines like V8 refers to a representation that is used as input for an interpreter. The article you linked to gives a good description.

    "Intermediate representation" or IR usually refers to data that a compiler uses internally, as an intermediate step (hence the name) between its input (usually the AST = abstract syntax tree, i.e. parsed version of the source text) and its output (usually machine code or byte code, but it could be anything, as in a source-to-source compiler).

    So in a traditional setup, you have:

    source --(parser)--> AST --(compiler front-end)--> IR --(compiler back-end)--> machine code

    where the IR is usually modified several times as the compiler performs various optimizations on it, before finally generating machine code from it. There can also be several different IRs; for example V8's earlier optimizing compiler ("Crankshaft") had two: high-level IR "Hydrogen" and low-level IR "Lithium", whereas V8's current optimizing compiler ("Turbofan") even has three: "JavaScript-level nodes", "Simplified nodes", and "Machine-level nodes".

    Now if you wanted to draw the boxes in your whiteboard diagram of the system a little differently, then instead of having a "parser" and a "compiler" you could treat everything between source and machine code as one big "compiler" (which as a first step parses the source). In that case, the AST would be a form of intermediate representation. But, as stated above, usually when people use the term IR they mean "compiler IR", not the AST.

    In a virtual machine like V8, the overall execution pipeline is more complicated than described above. It starts with:

    source --(parser)--> AST --(bytecode generator)--> bytecode

    This bytecode is primarily used as input for V8's interpreter. As an optimization, when V8 decides to run a function through the optimizing compiler, it does not start with the source code and a parser again, but instead the optimizing compiler uses the bytecode as its input. In diagram form:

    bytecode --(interpreter)--> program execution

    bytecode --(compiler front-end)--> IR --(compiler back-end)--> machine code --(CPU)--> program execution

    Now here's the part where your perspective comes in: since the bytecode in V8 is not only used as input for the interpreter, but also as input for the optimizing compiler and in that sense as a step on the way from source text to machine code, if you wanted to call it a special form of intermediate representation, you wouldn't technically be wrong. It would be an unusual definition of the term though. When a compiler theory textbook talks about "intermediate representation", it does not mean "bytecode".