I have over 100,000 csv files in the below format:
1,1,5,1,1,1,0,0,6,6,1,1,1,0,1,0,13,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,1,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,2,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,3,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,4,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,5,6,4,1,0,1,0,1,0,4,8,18,20,,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,6,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,7,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,8,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,2,0,0,12,12,1,2,4,1,1,0,13,4,7,8,18,20,21,25,27,29,31,32,,,,,,,,,,,,,,,,
All I need is field 10 and field 17 onward, field 10 is the counter indicate how many integer stored start from field 17 i.e. what I need is:
6,13,4,7,8,18,20
5,4,7,8,18,20
5,4,7,8,18,20
5,13,4,7,8,20
5,13,4,7,8,20
4,4,8,18,20
5,4,7,8,18,20
5,13,4,7,8,20
5,13,4,7,8,20
12,13,4,7,8,18,20,21,25,27,29,31,32
Max number of integer need to read is 28. I can easily achieve this by Getline in C++, however, from my previous experience, since I need to handle over 100,000 such files and each files may have 300,000~400,000 such lines. Therefore using Getline to read in the data and build a vector> may have serious performance issue for me. I tried to use fscanf to achieve this:
while (!feof(stream)){
fscanf(fstream,"%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%d",&MyCounter);
fscanf(fstream,"%*d,%*d,%*d,%*d,%*d,%*d"); // skip to column 17
for (int i=0;i<MyCounter;i++){
fscanf(fstream,"%d",&MyIntArr[i]);
}
fscanf(fstream,"%*s"); // to finish the line
}
However, this will call fscanf multiple times and may also create performance issue. Is there any way to read in variable number of integer at 1 call with fscanf ? Or I need to read into a string and then strsep/stoi it ? Compare to fscanf, which is better from performance point of view?
In order to maximize performance, you should map the files in memory with mmap
or equivalent and parse the file with ad hoc code, typically scanning one character at a time with a pointer, checking for '\n'
and/or '\r'
for end of record and converting the numbers on the fly for storage to your arrays. The tricky parts are:
mmap
call. The advantage is you only need check for end of file when you encounter a newline sequence.