Should a Lexer return a list of tokens or a token on the fly

I was implementing a Lexer for a project and was confused about how it should return a token to the parser. Should the lexer eagerly prepare a list of tokens to give to the parser upon declaration, or should the parser call a function that lazily returns the next token on the fly? Also, how would a tool like lex/flex do it?

Solution

The scanner produced by (f)lex is designed to produce one token each time it is called, which is the same as pretty well every other scanner generator in common use. Experience shows that this is the easiest interface to use. It also minimises unnecessary dynamic memory management.

However, the (f)lex-generated scanner is not limited to this model. A rule action does not have to return anything. It could push the token onto the end of a token list, or call an incremental token consumer (a "push" parser) or just drop the token on the floor (as is commonly done with whitespace tokens).

So (f)lex doesn't limit the possibilities. But it offers no feature to support the memory management necessary to accumulate a token array. If for some reason you find such an array necessary, it's your responsibility to implement it.

It's uncommon to see parsers which need a token array, in part because in many languages there are times when tokenisation depends on the parser's state (so-called "lexical feedback"). But it's not unknown. The GCC c++ compiler uses a token array, for example, both to facilitate preprocessing and to simplify the frequent use of lookahead and fallback.

But for an introductory parsing project, I'd definitely suggest using the common interface (on-demand lexical analysis). Indeed, I'd suggest starting with flex or equivalent, because it's more useful to learn how to use a lexical analyser before learning how to implement one.