Search code examples
c++iostdingetline

Reading lines from input


I'm looking to read from std::in with a syntax as below (it is always int, int, int, char[]/str). What would be the fastest way to parse the data into an int array[3] and either a string or char array.

#NumberOfLines(i.e.10000000)
1,2,2,'abc'
2,2,2,'abcd'
1,2,3,'ab'
...1M+ to 10M+ more lines, always in the form of (int,int,int,str)

At the moment, I'm doing something along the lines of.

//unsync stdio
std::ios_base::sync_with_stdio (false);
std::cin.tie(NULL);
//read from cin
for(i in amount of lines in stdin){
    getline(cin,str);
    if(i<3){
       int commaindex = str.find(',');
       string substring = str.substr(0,commaindex);
       array[i]=atoi(substring.c_str());
       str.erase(0,commaindex+1)
    }else{
       label = str;
    }
    //assign array and label to other stuff and do other stuff, repeat
}

I'm quite new to C++ and recently learned profiling with Visual Studio however not the best at interpreting it. IO takes up 68.2% and kernel takes 15.8% of CPU usage. getline() covers 35.66% of the elapsed inclusive time.

Is there any way I can do something similar to reading large chunks at once to avoid calling getline() as much? I've been told fgets() is much faster, however, I'm unsure of how to use it when I cannot predict the number of characters to specify.

I've attempted to use scanf as follows, however it was slower than getline method. Also have used `stringstreams, but that was incredibly slow.

scanf("%i,%i,%i,%s",&array[0],&array[1],&array[2],str);

Also if it matters, it is run on a server with low memory available. I think reading the entire input to buffer would not be viable? Thanks!

Update: Using @ted-lyngmo approach, gathered the results below.

time wc datafile

real    4m53.506s
user    4m14.219s
sys     0m36.781s

time ./a.out < datafile

real    2m50.657s
user    1m55.469s
sys     0m54.422s

time ./a.out datafile

real    2m40.367s
user    1m53.523s
sys     0m53.234s

Solution

  • You could use std::from_chars (and reserve() the approximate amount of lines you have in the file, if you store the values in a vector for example). I also suggest adding support for reading directly from the file. Reading from a file opened by the program is (at least for me) faster than reading from std::cin (even with sync_with_stdio(false)).

    Example:

    #include <algorithm> // std::for_each
    #include <cctype>    // std::isspace
    #include <charconv>  // std::from_chars
    #include <cstdio>    // std::perror
    #include <fstream>
    #include <iostream>
    #include <iterator>  // std::istream_iterator
    #include <limits>    // std::numeric_limits
    
    struct foo {
        int a[3];
        std::string s;
    };
    
    std::istream& operator>>(std::istream& is, foo& f) {
        if(std::getline(is, f.s)) {
            std::from_chars_result fcr{f.s.data(), {}};
            const char* end = f.s.data() + f.s.size();
    
            // extract the numbers
            for(unsigned i = 0; i < 3 && fcr.ptr < end; ++i) {
                fcr = std::from_chars(fcr.ptr, end, f.a[i]);
                if(fcr.ec != std::errc{}) {
                    is.setstate(std::ios::failbit);
                    return is;
                }
                // find next non-whitespace
                do ++fcr.ptr;
                while(fcr.ptr < end &&
                      std::isspace(static_cast<unsigned char>(*fcr.ptr)));
            }
    
            // extract the string
            if(++fcr.ptr < end)
                f.s = std::string(fcr.ptr, end - 1);
            else
                is.setstate(std::ios::failbit);
        }
        return is;
    }
    
    std::ostream& operator<<(std::ostream& os, const foo& f) {
        for(int i = 0; i < 3; ++i) {
            os << f.a[i] << ',';
        }
        return os << '\'' << f.s << "'\n";
    }
    
    int main(int argc, char* argv[]) {
        std::ifstream ifs;
        if(argc >= 2) {
            ifs.open(argv[1]); // if a filename is given as argument
            if(!ifs) {
                std::perror(argv[1]);
                return 1;
            }
        } else {
            std::ios_base::sync_with_stdio(false);
            std::cin.tie(nullptr);
        }
    
        std::istream& is = argc >= 2 ? ifs : std::cin;
    
        // ignore the first line - it's of no use in this demo
        is.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
    
        // read all `foo`s from the stream
        std::uintmax_t co = 0;
        std::for_each(std::istream_iterator<foo>(is), std::istream_iterator<foo>(),
                      [&co](const foo& f) {
                          // Process each foo here
                          // Just counting them for demo purposes:
                          ++co;
                      });
        std::cout << co << '\n';
    }
    

    My test runs on a file with 1'000'000'000 lines with content looking like below:

    2,2,2,'abcd'
    2, 2,2,'abcd'
    2, 2, 2,'abcd'
    2, 2, 2, 'abcd'
    

    Unix time wc datafile

    1000000000  2500000000 14500000000 datafile
    
    real    1m53.440s
    user    1m48.001s
    sys     0m3.215s
    

    time ./my_from_chars_prog datafile

    1000000000
    
    real    1m43.471s
    user    1m28.247s
    sys     0m5.622s
    

    From this comparison I think one can see that my_from_chars_prog is able to successfully parse all entries pretty fast. It was consistently faster at doing so than wc - a standard unix tool whos only purpose is to count lines, words and characters.