Search code examples
c++pipelinecoutcin

std::cin really slow


So I was trying to write myself a command for a linux pipeline. Think of it as a replica of gnu 'cat' or 'sed', that takes input from stdin, does some processing and writes to stdout.

I originally wrote an AWK script but wanted more performance so I used the following c++ code:

std::string crtLine;
crtLine.reserve(1000);
while (true)
{
    std::getline(std::cin, crtLine);
    if (!std::cin) // failbit (EOF immediately found) or badbit (I/O error)
        break;

    std::cout << crtLine << "\n";
}

This is exactly what cat (without any parameters does). As it turns out, this program is about as slow as its awk counterpart, and nowhere near as fast as cat.

Testing on a 1GB file:

$time cat 'file' | cat | wc -l
real    0m0.771s

$time cat 'file' | filter-range.sh | wc -l
real    0m44.267s

Instead of getline(istream, string) I tried cin.getline(buffer, size) but no improvements. This is embarassing, is it a buffering issue? I also tried fetching 100KB at a time instead of just one line, no help! Any ideas?

EDIT: What you folks say makes sense, BUT the culprit is not string building/copying and neither is scanning for newlines. (And neither is the size of the buffer). Take a look at these 2 programs:

char buf[200];
while (fgets(buf, 200, stdin))
    std::cout << buf;

$time cat 'file' | ./FilterRange > /dev/null
real    0m3.276s




char buf[200];
while (std::cin.getline(buf, 200))
    std::cout << buf << "\n";

$time cat 'file' | ./FilterRange > /dev/null
real    0m55.031s

Neither of them manipulate strings and both of them do newline scanning, however one is 17 times slower than the other. They differ only by the use of cin. I think we can safely conclude that cin screws up the timing.


Solution

  • This is exactly what cat (without any parameters does).

    Not really. This has exactly the same effect as /bin/cat, but it does not use the same method.

    /bin/cat looks more like this:

    while( (readSize = read(inFd, buffer, sizeof buffer)) > 0)
      write(outFd, buffer, readSize);
    

    Notice that /bin/cat does no processing on its input. It doesn't build a std::string out of it, it doesn't scan it for \n, it just does one system call after another.

    Your program, on the other hand, builds strings, make copies of them, scans for \n, etc, etc.

    This small, complete program runs 2-3 orders of magnitude slower than /bin/cat:

    #include <string>
    #include <iostream>
    
    int main (int ac, char **av) {
      std::string crtLine;
      crtLine.reserve(1000);
      while(std::getline(std::cin, crtLine)) {
        std::cout << crtLine << "\n";
      }
    }
    

    I timed it thus:

    $ time ./x < inputFile > /dev/null
    $ time /bin/cat < inputFile > /dev/null
    


    EDIT This program gets within 50% of the performance of /bin/cat:

    #include <string>
    #include <iostream>
    #include <vector>
    
    int main (int ac, char **av) {
      std::vector<char> v(4096);
      do {
        std::cin.read(&v[0], v.size());
        std::cout.write(&v[0], std::cin.gcount());
      } while(std::cin);
    }
    

    In short, if your requirement is to perform line-by-line analysis of the input, then you will have to pay some price to use formatted input. If, on the other hand, you need to perform byte-by-byte analysis, then you can use unformatted input and go faster.