LLVM - code generation flow

When I went through the LLVM document,

There are meanings in some terms that I dont fully understand.

Please provide feedbacks if you know any.

[Frontend] Source code --> Tokeniser (Token stream) --> Parser ( Parser Action )

Can someone explain what does Tokeniser exactly do ? and what does Parser do too ? Possibly provide an example which will make this more clear.
[Backend] IR --> Optimiser (IR) --> Code Generation

I dont understand at this step, what optimization should be done.

I know there some difference between various frontends and backends. But what I am asking is for general case.

Thanks for any feedback from you.

Solution

All of it is standard compiler terminology (see Aho, Lam, Sethi, Ullmann: Compilers).

The input to a compiler is a file which contains a string of characters.

IN:

int main(int argc, char* argv[])
{

  printf("blimey\n");
  // this is a comment
  return 0;
}

The output of the tokenizer is a sequence of tokens, where a token is defined as string of characters not containing whitespace (you can also design a better token type which would keep track if numbers look like reals or integers, sequences of charactes are reserved words or not...). Sometimes reserved words of the language are replaced by a unique identifier (usually an integer), sequence of characters which represent numbers are converted to actual numbers, so you might get:

OUT:

"int" "main" "(" "int" "argc" "," "char", "*", "argv" "[" "]" "{"
"printf" "(" "blimey\n" ")" ";" "return" "0" ";" "}"

Notice that the empty lines after "{" are gone so is the comment. You don't need the comment to build a parse tree. I also cheated with the string "blimey\n", it would need to be annotated as a constant (quoted) string. That was the tokenizer. The point of separating tokenizing/parsing is that tokenizing can be done with a finite state automaton which is faster than a parser.

The parser builds a tree out of the sequence above. Showing the output of the parser is difficult here because we do not have a grammar for the language being parsed. So I take a simpler language:

Tokenizer output for 'foo + 3 * bar':

"foo" "+" "3" "*" "bar"

There are a number of grammars for the language of arithmetic expressions in most of them the parser would build this tree:

      +
    /   \
"foo"    *
        / \
       3  "bar"

AST: "Abstract syntax trees differ... from parse trees because superficial distinctions of form, unimportant for translation, do not appear in syntax trees" (Compilers, Sec 2.5)

Suppose you wrote the expression "foo+(3*bar)". The parser would still build the tree above, because the parenthesis are not necessary. But if you wrote "(foo+3)*bar", you would get a different tree:

        *
      /  \
     +   "bar"
   /  \
"foo"  3

No parenthesis! The abstract tree structure encodes everything. Implementation-wise: if you are writing your compiler in a modern object-oriented language, the abstract syntax tree would be a represented by a class hierarchy. In C, you'd have a 'struct' for each node type tagged (with an integer).

It is possible to generate executable code (or whatever code you need) from the tree but it is tedious. So in many cases (p code for Pascal, other three address code intermediate representations for C...) the tree is transformed (flattened) to an intermediate representation (IR). The purpose is that it is easier to do optimisations on a well designed IR. There are millions of optimizations you may want to do:

use algebraic identities x+0 => x, x*1 => x etc
drop unused variables
simplify control flow
reorder assignments
...

The optimiser most of the times preserves the IR...ie is a function IR -> IR, but there are a couple of clever optimizers which translate IR1 -> IR2 (change the representation), to make certain properties explicit (a search with 'benton' 'moggi' 'monad' will get you started).