parsing compiler-construction lexical-analysis

Lexical analyser and parser communication

Most of the resources on lexical analyzers and parsers illustrate use of streams to communicate between them (or so I understand).

It is explained that the parser asks for the next token, say by calling a function getNextToken(), and the lexer responds to it by returning the next token. Are we supposed to think of them as two objects interacting within the same program or two different programs interacting through streams?

Also, I haven't been able to understand why a serial approach isn't chosen, i.e. the lexical analyzer runs till the end of the provided source, and only then does the parser use the output of the lexical analyzer for parsing. To be precise, if the lexical analyzer reads the next lexeme only when the parser asks for the next token, how is an error handled? Especially if the error occurs towards the end of the file, all the computation done by the parser might be wasted due to the error (assuming a very basic parser without any error handling capabilities). Is recent output cached?

Solution

My answer below is in reference to the Flex-Bison (or Lex-Yacc) model of compilation only. I have little knowledge of other models.

I think of the lexer / parser combination as two co-operating modules in the same program. When you use Flex with Bison you see that the latter calls a function yylex() provided by the former (yylex() is equivalent to the getNextToken() function in your question). So it makes more sense to think of them as co-operating units in a single program rather than two different programs. Moreover, if the lexer and parser were 2 different programs, you'd have to deal with inter-process communication, shared memory, and related issues, further complicating the task at hand.

To answer your second question: I can think of one important issue that could arise from the parser coming into action after the lexer had finished reading all input: memory usage would be enormous for even moderate-sized programs, as you would have to store data structures for every token, in memory (think of tokens like , and = occupying multiple bytes in memory and you'll quickly see why it's not scalable).

As for error handling: if the lexer cannot match the input to any regular expression, then yylex() should return a -1 to the parser, using a flex rule as follows:

.               { return -1; }

(Note the near-invisible period in the first column, which matches any input symbol except \n)

(NOTE: This rule should be the last rule to appear in your flex file, because the order of the rules determines priority: a token is matched by Flex using the first possible rule in the flex file.)

A return value of -1 by the lexer indicates a tokenization error, and the Bison parser handles it automatically by calling yyerror(char *) (ideally defined by you); otherwise, if the input encounters a parsing error, the parser, again, calls yyerror(char *).

Also, if you want to display the erroneous piece of code when an error is encountered, you'd have to have a way to access the related source code given the defective token, which means the approach of reading the input entirely followed by parsing would not work at all, unless you store associated source code with each token while tokenizing, essentially making a memory behemoth of a compiler.