Search code examples
compiler-construction

File buffering in lexers: is it advisable, now that the OS (and language libraries) already implement buffers internally?


A technique of lexer optimization was mentioned in two books: Engineering a compiler by K.Cooper et al. and Crafting a Compiler by C.Fischer et al. Here is the except from the first book (page 69):

While character-by-character I/O leads to clean algorithmic formulations, the overhead of a procedure call per character is significant relative to the cost of simulating the DFA in either a table-driven or a direct-coded scanner. To reduce the I/O cost per character, the compiler writer can use buffered I/O, where each read operation returns a longer string of characters, or buffer, and the scanner then indexes through the buffer. The scanner maintains a pointer into the buffer...

My question is, what is the significance of this technique? Now that memory buffering is often already implemented by the Operating System, why did the authors suggest we implement a buffer? (Also, standard libraries provided by high-level languages usually have a buffer maintained with the filestream handling routines, like C++'s std::ifstream).

I know in some applications, like database systems, a custom buffer mechanism is not only desirable (knows more about access patterns), but sometimes necessary (recovery, logging). Do similar reasongs apply to lexers in compilers? If so, how?

EDIT: Here is a similar question: link, but I'd like to know more about the arguments for a custom buffer (like the argument supporting a buffer in database systems), if there is any.

Another post here compared manual buffering with std::fstream buffering in C++.


Solution

  • As others have pointed out, whether the OS buffers or not (it does) it is very costly for your application to rely on it as those OS/File System buffers are not in your applications address space. Why? Because for your app to get at that data it typically needs to travel through layers of calls to get to the OS buffers. If you are doing this for 1 character/byte at a time this will incur overhead.

    If you are using an IO library: Some of them do or will read 'ahead' for performance reasons and keep the OS calls to a minimum.

    If you, on the other hand, are operating without leveraging a library then it is strongly advised for you to setup buffered IO capability for the same reason other libraries will.

    Finally, the end result of your compilation is an executable thing. Unless you do not allow IO to occur you will want to have your language specific (assume self hosted) run-time to provide buffered IO for the same reasons. If your run-time is based on a language or series of libraries that provide it, you should be good.