Understanding Double Buffering in Lexical Analyzer

I'm reading the "Purple Dragon Book" on Compilers as part of my Compiler Construction course at university. I'm having trouble understanding some things about double buffering while scanning input as part of Lexical Analysis.

Here's the text in book:

Two pointers to the input are maintained:

Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to determine.

Pointer forward scans ahead until a pattern match is found; the exact strategy whereby this determination is made will be covered in the balance of this chapter.

So, correct me if I'm wrong: One buffer is read, and when the input of that buffer is exhausted, the other buffer is filled with new data from source file, and now buffers are swapped. Forward and beginnig pointers now point to the freshly filled buffer.

My question is, what if some part of lexeme is at the end of current buffer? Then when buffers switch, half of the lexeme will be end of one buffer, and half at the end of new buffer. The pointers will move to new buffer, and we don't exactly know that the other half was left in other buffer?

Sorry if the question is vague, but I've been agonizing for quite some time on how this scenario will be handled. I think same problem will occour using single buffer.

Solution

It's quite possible for the current lexeme to be split between the two buffers. No attempt is made to avoid this possibility.

It's also possible, with certain collections of lexical patterns, for the current lexeme to be entirely contained in the older buffer, even though the data in the newer buffer is also valid. The Dragon book authors seem to be heroically avoiding this possibility, although I don't suppose it is that important (and even less important thirty years after the text was written).

In the end, it doesn't really matter where the token is, because the input buffers as described in the Dragon book were only used as input buffers. The scanner would have had a separate token buffer, traditionally a statically-allocated char array, and that's the source of the token which is handed to the action procedure.

Keeping a separate copy of the token seems a bit inefficient, but it's not too bad, particularly assuming you're willing to restrict the maximum size of a token. The token buffer solves a number of issues, particularly for lexical scanners written in C. To start with, it solves the problem of tokens split between input buffers. Also, it solves the problem of NUL-terminating the scanned token, which is required if the token is going to be used by any C standard library string function. It's not easy to NUL-terminate the token in the input buffer, because the character following the token might well be an essential part of the next token. (In some cases, it will be ignorable whitespace, but even then not all whitespace is ignorable; newline characters must be counted, for example, in order to maintain the line number for error messages and debugging information.

(There are ways to produce a NUL-terminated string, which I might get to later, but it definitely complicates the logic of the scanner. Flex NUL-terminates the token in place, for example, but that means that it has to overwrite and then restore a byte in the input buffer, with the consequence that you cannot lexically scan a string literal.)

Low-cost copying of the scanned token into a token buffer can be done by keeping a pointer to the next byte in the token buffer and unconditionally copying each character into the token buffer as it is processed. Once the end of the token is identified, the token pointer is backed up if necessary and a NUL terminator is stored there, overwriting any undesired character which might have been unnecessarily placed into the token buffer.

The point of the twinned buffers is to make it possible to fall-back more than one byte at the end of a token. That's not something which happens often, but in many languages it can happen. For example, C and C++ allow the . token and the ... token, but not a .. token. The scanner needs to be able to handle that case. If the input is, for example, a..4, the scanner needs to return three tokens (a, . and .4). But a... would be an identifier a followed by .... (That's syntactically possible in C++. In C it's likely to be a syntax error.)

So you need to consider the possibility that . falls at the end of an input buffer, while the next input buffer starts with .4. The scanner must read forward to the 4 in order to decide which token to return, but once it does that it will return the token ., which is entirely contained in the first buffer. When the scan resumes, it is vital that the scanner not attempt to refill the second buffer, since that would lose a buffer's worth of data.

Despite that corner case, it is still possible to use the sentinel technique described in the Dragon book to eliminate almost all buffer-end checks, which significantly speeds up the scanner. One of the checks which can be eliminated is the check for token buffer overflow. It's only necessary to make the token buffer as big as the pair of input buffers combined; then, the token length limit only needs to be tested when the sentinel is detected, which happens very rarely. (Perhaps once every four or eight thousand characters.)

Most lexical scanners written at the time did severely restrict token lengths, and some still do. (Flex no longer automatically imposes a token length limit but there are still configuration options which have the effect of limiting token length.)

Certainly, that's not the only way to write a scanner. The Flex scanner generator generates scanners with no artificial token length limits and without restricting the character set to 255 valid characters. It's true that it is not really efficient at handling long tokens and tokens containing a NUL, but those cases are very rare so the inefficiencies are mostly negligible.

Flex uses only a single buffer, and it also passes a pointer into this buffer as the value of yytext. It handles end of buffer, which it detects with a sentinel, by relocating the text scanned since lexemeBegin to the start of the buffer, and then filling the buffer from the relocated value of forward to the end of the buffer. If end of buffer is reached and no relocation is needed because the token text already starts at the beginning of the buffer, then it calls realloc or equivalent to expand the input buffer, and then reads a few kilobytes into the new input buffer, starting at the relocated endpoint of the previous buffer. In order to be able to use a sentinel even though NUL bytes can occur in the input, it checks the address of the NUL with the address of the end of the input buffer; if they are not the same, then the NUL was part of the input and and Flex looks up the action for NUL in the state transition table. (It's a bit more complicated than that, but that's the high-levvel oveview.) Since the buffer end is reached only occasionally, the extra cost of checking for NUL and so on is not paid very often.

The advantage of the flex architecture is that it reads a bit faster, since it's not working a character at a rime, and it can handle big tokens, albeit inefficiently. Lex couldn't handle big tokens because the token buffer (yytext) was a static array, not dynamically allocated, and therefore couldn't be grown.