I'm looking to read from std::in
with a syntax as below (it is always int
, int
, int
, char[]
/str
). What would be the fastest way to parse the data into an int array[3]
and either a string or char
array.
#NumberOfLines(i.e.10000000)
1,2,2,'abc'
2,2,2,'abcd'
1,2,3,'ab'
...1M+ to 10M+ more lines, always in the form of (int,int,int,str)
At the moment, I'm doing something along the lines of.
//unsync stdio
std::ios_base::sync_with_stdio (false);
std::cin.tie(NULL);
//read from cin
for(i in amount of lines in stdin){
getline(cin,str);
if(i<3){
int commaindex = str.find(',');
string substring = str.substr(0,commaindex);
array[i]=atoi(substring.c_str());
str.erase(0,commaindex+1)
}else{
label = str;
}
//assign array and label to other stuff and do other stuff, repeat
}
I'm quite new to C++ and recently learned profiling with Visual Studio however not the best at interpreting it. IO takes up 68.2% and kernel takes 15.8% of CPU usage. getline()
covers 35.66% of the elapsed inclusive time.
Is there any way I can do something similar to reading large chunks at once to avoid calling getline()
as much? I've been told fgets()
is much faster, however, I'm unsure of how to use it when I cannot predict the number of characters to specify.
I've attempted to use scanf
as follows, however it was slower than getline
method. Also have used `stringstreams, but that was incredibly slow.
scanf("%i,%i,%i,%s",&array[0],&array[1],&array[2],str);
Also if it matters, it is run on a server with low memory available. I think reading the entire input to buffer would not be viable? Thanks!
Update: Using @ted-lyngmo approach, gathered the results below.
time wc datafile
real 4m53.506s
user 4m14.219s
sys 0m36.781s
time ./a.out < datafile
real 2m50.657s
user 1m55.469s
sys 0m54.422s
time ./a.out datafile
real 2m40.367s
user 1m53.523s
sys 0m53.234s
You could use std::from_chars
(and reserve()
the approximate amount of lines you have in the file, if you store the values in a vector
for example). I also suggest adding support for reading directly from the file. Reading from a file opened by the program is (at least for me) faster than reading from std::cin
(even with sync_with_stdio(false)
).
Example:
#include <algorithm> // std::for_each
#include <cctype> // std::isspace
#include <charconv> // std::from_chars
#include <cstdio> // std::perror
#include <fstream>
#include <iostream>
#include <iterator> // std::istream_iterator
#include <limits> // std::numeric_limits
struct foo {
int a[3];
std::string s;
};
std::istream& operator>>(std::istream& is, foo& f) {
if(std::getline(is, f.s)) {
std::from_chars_result fcr{f.s.data(), {}};
const char* end = f.s.data() + f.s.size();
// extract the numbers
for(unsigned i = 0; i < 3 && fcr.ptr < end; ++i) {
fcr = std::from_chars(fcr.ptr, end, f.a[i]);
if(fcr.ec != std::errc{}) {
is.setstate(std::ios::failbit);
return is;
}
// find next non-whitespace
do ++fcr.ptr;
while(fcr.ptr < end &&
std::isspace(static_cast<unsigned char>(*fcr.ptr)));
}
// extract the string
if(++fcr.ptr < end)
f.s = std::string(fcr.ptr, end - 1);
else
is.setstate(std::ios::failbit);
}
return is;
}
std::ostream& operator<<(std::ostream& os, const foo& f) {
for(int i = 0; i < 3; ++i) {
os << f.a[i] << ',';
}
return os << '\'' << f.s << "'\n";
}
int main(int argc, char* argv[]) {
std::ifstream ifs;
if(argc >= 2) {
ifs.open(argv[1]); // if a filename is given as argument
if(!ifs) {
std::perror(argv[1]);
return 1;
}
} else {
std::ios_base::sync_with_stdio(false);
std::cin.tie(nullptr);
}
std::istream& is = argc >= 2 ? ifs : std::cin;
// ignore the first line - it's of no use in this demo
is.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
// read all `foo`s from the stream
std::uintmax_t co = 0;
std::for_each(std::istream_iterator<foo>(is), std::istream_iterator<foo>(),
[&co](const foo& f) {
// Process each foo here
// Just counting them for demo purposes:
++co;
});
std::cout << co << '\n';
}
My test runs on a file with 1'000'000'000 lines with content looking like below:
2,2,2,'abcd'
2, 2,2,'abcd'
2, 2, 2,'abcd'
2, 2, 2, 'abcd'
Unix time wc datafile
1000000000 2500000000 14500000000 datafile
real 1m53.440s
user 1m48.001s
sys 0m3.215s
time ./my_from_chars_prog datafile
1000000000
real 1m43.471s
user 1m28.247s
sys 0m5.622s
From this comparison I think one can see that my_from_chars_prog
is able to successfully parse all entries pretty fast. It was consistently faster at doing so than wc
- a standard unix tool whos only purpose is to count lines, words and characters.