I want to process a large text file line by line. I have found some code that looks to be very fast when reading a file:
std::vector<std::byte> load_file(std::string const& filepath)
{
std::ifstream ifs(filepath, std::ios::binary|std::ios::ate);
if(!ifs)
throw std::runtime_error(filepath + ": " + std::strerror(errno));
auto end = ifs.tellg();
ifs.seekg(0, std::ios::beg);
auto size = std::size_t(end - ifs.tellg());
if(size == 0) // avoid undefined behavior
return {};
std::vector<std::byte> buffer(size);
if(!ifs.read((char*)buffer.data(), buffer.size()))
throw std::runtime_error(filepath + ": " + std::strerror(errno));
return buffer;
}
Now the problem is, I do not know how to use this to read lines of the file.
This is a solution I came up with.... but somehow it looks very bad or inefficient! Is there any better way to do this, or the load_file
function is just not made to do this job as in reading lines from a text file?
auto fileConent = load_file(R"(C:\analysis\simple.txt)");
auto line = vector<std::byte>();
for(const auto& byte : fileConent) {
if(static_cast<char>(byte) != '\n') {
line.push_back(byte);
} else {
std::cout
<< std::string_view(reinterpret_cast<char*>(line.data()), line.size())
<< std::endl;
line.clear();
}
}
If there is a reference to the term fast, then you will often read comments like:
So, first, please make sure that you compile your program with all speed optimizations on. Then, please understand, 1'000'000 lines are considered small nowadays.
Regarding the shown source code:
Your first code example simply reads the the file into a std::vector
of std::bytes
. It does this by trying to read the file size and then using the very fast read
-function of the std::ifstream
.
This will work and will be fast, but will not help you, because you need lines.
The second code snippet analyzes the the std::vector
that was read before and prints a line after each '\n' has been found. This is basically OK. But, the std::string_view
s are not stored. Maybe this solution is sufficient for you.
Anyway, there are some comments.
file_size
-function from the <filesystem>
insteadpubsetbuf
-function of the std::ifstream
s streambuf
Using not optimized stream functions like std::getline
or std::istringstream
will not help. This will be much much slower.
To show to you what can be achieved, I created a test file with 50'000'000 lines. The resulting size was 1.5GB for my test.
Please see the example ocde below:
#include <iostream>
#include <fstream>
#include <chrono>
#include <filesystem>
#include <random>
#include <string_view>
struct Timer {
std::chrono::time_point<std::chrono::high_resolution_clock> startTime{};
long long elapsedTime{};
void start() { startTime = std::chrono::high_resolution_clock::now(); }
void stop() { elapsedTime = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - startTime).count(); }
friend std::ostream& operator << (std::ostream& os, const Timer& t) { return os << t.elapsedTime << " ms "; }
};
constexpr size_t NumberOfRows = 50'000'000U;
constexpr size_t NumberOfRowsGuess = 60'000'000U;
constexpr int MinLineLength = 10;
constexpr int MaxLineLength = 30;
const std::string testDataFileName{ "r:\\test.txt" };
void createTestFile() {
static std::random_device rd{};
static std::mt19937 gen{ rd() };
std::uniform_int_distribution<unsigned int> uniformDistributionStringLength(MinLineLength, MaxLineLength);
if (std::ofstream testDataStream(testDataFileName); testDataStream) {
Timer t1; t1.start();
for (size_t row{}; row < NumberOfRows; ++row) {
testDataStream << row << ' ' << std::string(uniformDistributionStringLength(gen), 'a') << '\n';
}
t1.stop(); std::cout << "\nDuration for test file creation: " << t1 << '\n';
}
else std::cerr << "\nError: Could not open file '" << testDataFileName << "' for writing.\n\n";
}
constexpr std::size_t IOBufSize = 5'000'000u;
static char ioBuf[IOBufSize];
int main() {
//createTestFile();
if (std::ifstream ifs{ testDataFileName,std::ios::binary }; ifs) {
Timer tOverall{}; tOverall.start();
// To speed up reading of the file, we will set a bigger input buffer
ifs.rdbuf()->pubsetbuf(ioBuf, IOBufSize);
// Here we will store the complete file, all data
std::string text{};
// Get number of bytes in file
const std::uintmax_t size = std::filesystem::file_size(testDataFileName);
text.resize(size);
// Read the whole file with one statement. Will be ultrafast
Timer t; t.start();
ifs.read(text.data(), size);
t.stop(); std::cout << "Duration for reading complete file:\t\t" << t << "\t\tData read: " << ifs.gcount() << " bytes\n";
// Creating a vector with string views and reserve memory. Make a big guess
std::vector<std::string_view> lines{};
lines.reserve(NumberOfRowsGuess);
// Create the string views with the lines
char* start{ text.data() };
char* end{start};
std::size_t index{};
t.start();
for (const char c : text) {
++end;
if (c == '\n') {
lines.push_back({start, end });
start = end;
}
}
std::cout << "\nNumber of lines Read: " << lines.size() << '\n';
t.stop(); std::cout << "Duration for creating all string views:\t\t" << t << '\n';
tOverall.stop(); std::cout << "\n\nDuration overall:\t\t\t\t" << tOverall << '\n';
}
else std::cout << "\n\nError: Could not open test file '" << testDataFileName << "'\n\n";
}
I tested the program on my 12 years old Windows 7 machine.
Program output was:
Duration for reading complete file: 752 ms Data read: 1538880087 bytes
Number of lines Read: 50000000
Duration for creating all string views: 1769 ms
Duration overall: 2966 ms
This should be "sufficiently" fast.