parsing compiler-construction tokenize lexical-analysis

How does the data structure for a lexical analysis look?

I know the lexical analyser tokenizes the input and stores it in a stream, or at least that is what I understood. Unfortunately nearly all articles I have read only talk about lexing simple expressions. What I am interested in is how to tokenize something like:

if (fooBar > 5) {
  for (var i = 0; i < alot.length; i++) {
    fooBar += 2 + i;
  }
}

Please note that this is pseudo code.

Question: I would like to know how the data structure looks like for tokens created by the lexer? I really have no idea for the example i gave above where code is nested. Some example would be nice.

Solution

First of all, tokens are not necessarily stored. Some compilers do store the tokens in a table or other data structure, but for a simple compiler (if there is such a thing) it's sufficient in most cases that the lexer can return the type of the next token to be parsed and then in some cases the parser might ask the lexer for the actual text that the token is made up of.

If we use your sample code,

if (fooBar > 5) {
  for (var i = 0; i < alot.length; i++) {
    fooBar += 2 + i;
  }
}

The type of the first token in this sample might be defined as TOK_IF corresponding to the "if" keyword. The next token might be TOK_LPAREN, then TOK_IDENT, then TOK_GREATER, then TOK_INT_LITERAL, and so on. What exactly the types should be is defined by you as the author of the lexer (or tokenizer) code. (Note that there are about a million different tools to help you avoid the somewhat tedious task of coming up with these details by hand.)

Except for TOK_IDENT and TOK_INT_LITERAL the tokens we've seen so far are defined entirely by their type. For these two, we would need to be able to ask the lexer for the underlying text so that we can evaluate the value of the token.

So a tiny excerpt of the parser dealing with an IF statement in pseudo-code might look something like:

...
  switch(lexer.GetNextTokenType())
  case TOK_IF:
    {
      // "if" statement  
      if (lexer.GetNextTokenType() != TOK_LPAREN)
        throw SyntaxError('( expected');
      ParseRelationalExpression(lexer);
      if (lexer.GetNextTokenType() != TOK_RPAREN)
        throw SyntaxError(') expected');
      ...

and so on.

If the compiler did choose to actually store the tokens for later reference, and some compilers do e.g. to allow for more efficient backtracking, one way would be to use a structure similar to the following

struct {
  int TokenType;
  char* TokenStart;
  int TokenLength;
}

The container for these might be a linked list or std::vector (assuming C++).