Search code examples
c++iolexer

Why is reading char by char faster than iterating over whole file string?


I have a lexer that consumes a file character by character, looking for tokens. I tried two methods for NextChar(), the first reads directly from ifstream through ifstream::get(ch), and the second loads the whole file into a std::stringstream to avoid disk I/O overhead.

get() method:

inline void Scanner::NextChar()
{
    inputStream.get(unscannedChar);
    currentCol++;

    while (unscannedChar == ' ')
    {
        inputStream.get(unscannedChar);
        currentCol++;
    }

    if (inputStream.eof()) {
        unscannedChar = std::char_traits<char>::eof();
    }

}

stringstream method: while loading the file into stringstream takes no time, indexing is extremely slow.

inline void Scanner::NextChar()
{
    unscannedChar = buffer.str()[counter++];
    currentCol++;

    while (unscannedChar == ' ')
    {
        unscannedChar = buffer.str()[counter++];
        currentCol++;
    }

    
    if (counter > buffer.str().size())
    {
        unscannedChar = std::char_traits<char>::eof();
    }

}

I expected the 2nd method to be much faster, since it's iterating over characters in memory not on disk, but I was wrong, and here are some of my tests:

| tokens    | ifstream::get()   | stringstream::str()[]     |
|--------   |-----------------  |-----------------------    |
| 5         | 0.001 (sec)       | 0.001 (sec)               |
| 800       | 0.002 (sec)       | 0.295 (sec)               |
| 21000     | 0.044 (sec)       | 693.403 (sec)             |    

NextChar() is extremely important for my project, and I need to make it as fast as possible and I would appreciate explaining why am I having the previous results?


Solution

  • std::ifstream is already doing its own internal buffering, so it's not like it has to go out and wait for the hard drive to respond every time you call get(ch); 99.99% of the time, it already has your next character available in its internal read-buffer and just has to do a one-byte copy to hand it over to your code.

    Given that, there's no additional speedup to be gained by copying the entire file into your own separate RAM buffer; indeed, doing that is likely to make things slower since it means you can't start parsing the data until after the entire file has been read into RAM (whereas with ifstream's smaller read-ahead buffer, your code can start parsing chars as soon as the first part of the file has been loaded, and parsing can continue to some extent in parallel with disk reads after that)

    On top of that, stringstream::str() is returning a string object by-value every time you call it, which can be very expensive if the returned string is large. (i.e. you are making an in-RAM copy of the file's contents, and then throwing it away, for every character you parse!)