I am writing a program which reads large (10Gb+) text files, structured in chunks, like this:
@Some_header
ATCCTTTATTCGGTATCGGATATATTACGCGCGGGGGATATCGGGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:::::::::
@Some_header unfixable_error
ATTTATTTAGAGGAGACTTTTATTTACCCCCCCCGGGGGGATTTTA
+
FFFFFFF:::::::::::::::FFFFFFFFFFUUUUUUUFFUUFUU
@Some_header
ATTATTCCCCTTTTTATACCGGGGGGAAATTAGGGGGGGCCCCTTT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
A chunk consists of the @header, the ATCG sequence, the '+', and then another string with the same length as the ATCG sequence. Some @header lines have 'unfixable_error' just before the newline. My program must read through these files and write all chunks, except for those with a @header unfixable_error, to a new file.
Currently, my approach is to utilize 'getline()', like so:
std::ifstream inFile(inFileStr);
std::ofstream outFile(outFileStr);
std::string currLine;
while (getline(inFile, currLine)) {
if (currLine == "+" || currLine.substr(currLine.length()-5, 5) != "error") {
outFile << currLine << std::endl;
}
else {
for (int i = 0; i < 3; i++) {
getline(inFile, currLine);
}
}
}
inFile.close();
outFile.close();
I'm certain there's a better solution to this, however. What is the fastest feasible way to accomplish this?
Here is few points:
substr
creates a new string which is quite expensive for a simple comparison. You can use string views since C++17 to avoid new strings to be created. An alternative solution is to use compare
with a position and size. Since C++20, there is also ends_with
which is simpler here.std::endl
flushes the output which is inefficient. Please consider just using '\n'
instead.getline
tends to be a bit slow in practice. You can read big chunks and parse it yourself while avoiding copies as much as possible. Writting chunks is more efficient too. The chunks needs not to be too big so to fit in the caches of the CPU (the RAM is slow compared to caches). For example, skipping lines with getline
is not efficient since it copies data in memory. With chunks, you can directly search for the next three \n
without any write. This operation can be easily vectorized using SIMD instruction so it can be very fast (compilers should be able to do that for you).currLine
might result in a small speed up.