Search code examples
c++fileio

How to convert vector of bytes into lines and store as string


I want to process a large text file line by line. I have found some code that looks to be very fast when reading a file:

std::vector<std::byte> load_file(std::string const& filepath)
{
  std::ifstream ifs(filepath, std::ios::binary|std::ios::ate);

  if(!ifs)
    throw std::runtime_error(filepath + ": " + std::strerror(errno));

  auto end = ifs.tellg();
  ifs.seekg(0, std::ios::beg);

  auto size = std::size_t(end - ifs.tellg());

  if(size == 0) // avoid undefined behavior
    return {};

  std::vector<std::byte> buffer(size);

  if(!ifs.read((char*)buffer.data(), buffer.size()))
    throw std::runtime_error(filepath + ": " + std::strerror(errno));

  return buffer;
}

Now the problem is, I do not know how to use this to read lines of the file.

This is a solution I came up with.... but somehow it looks very bad or inefficient! Is there any better way to do this, or the load_file function is just not made to do this job as in reading lines from a text file?

  auto fileConent = load_file(R"(C:\analysis\simple.txt)");

  auto line = vector<std::byte>();
  for(const auto& byte : fileConent) {
    if(static_cast<char>(byte) != '\n') {
      line.push_back(byte);
    } else {
      std::cout
          << std::string_view(reinterpret_cast<char*>(line.data()), line.size())
          << std::endl;
      line.clear();
    }
  }

Solution

  • If there is a reference to the term fast, then you will often read comments like:

    • Did you compile in release mode with all optimizations on?
    • What is a big file and what is a large number of lines?

    So, first, please make sure that you compile your program with all speed optimizations on. Then, please understand, 1'000'000 lines are considered small nowadays.


    Regarding the shown source code:

    Your first code example simply reads the the file into a std::vectorof std::bytes. It does this by trying to read the file size and then using the very fast read-function of the std::ifstream.

    This will work and will be fast, but will not help you, because you need lines.

    The second code snippet analyzes the the std::vector that was read before and prints a line after each '\n' has been found. This is basically OK. But, the std::string_views are not stored. Maybe this solution is sufficient for you.


    Anyway, there are some comments.

    • Your method to detect the file size is not safe. Please use the file_size-function from the <filesystem> instead
    • You can boost the read operation by providing a bigger input buffer with the pubsetbuf-function of the std::ifstreams streambuf

    Using not optimized stream functions like std::getline or std::istringstream will not help. This will be much much slower.

    To show to you what can be achieved, I created a test file with 50'000'000 lines. The resulting size was 1.5GB for my test.

    Please see the example ocde below:

    #include <iostream>
    #include <fstream>
    #include <chrono>
    #include <filesystem>
    #include <random>
    #include <string_view>
    
    struct Timer {
        std::chrono::time_point<std::chrono::high_resolution_clock> startTime{};
        long long elapsedTime{};
        void start() { startTime = std::chrono::high_resolution_clock::now(); }
        void stop() { elapsedTime = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - startTime).count(); }
        friend std::ostream& operator << (std::ostream& os, const Timer& t) { return os << t.elapsedTime << " ms "; }
    };
    
    constexpr size_t NumberOfRows =      50'000'000U;
    constexpr size_t NumberOfRowsGuess = 60'000'000U;
    constexpr int MinLineLength = 10;
    constexpr int MaxLineLength = 30;
    
    const std::string testDataFileName{ "r:\\test.txt" };
    
    void createTestFile() {
        static std::random_device rd{};
        static std::mt19937 gen{ rd() };
        std::uniform_int_distribution<unsigned int> uniformDistributionStringLength(MinLineLength, MaxLineLength);
    
        if (std::ofstream testDataStream(testDataFileName); testDataStream) {
            Timer t1; t1.start();
            for (size_t row{}; row < NumberOfRows; ++row) {
                testDataStream << row << ' ' << std::string(uniformDistributionStringLength(gen), 'a') << '\n';
            }
            t1.stop(); std::cout << "\nDuration for test file creation: " << t1 << '\n';
        }
        else std::cerr << "\nError: Could not open file '" << testDataFileName << "' for writing.\n\n";
    }
    
    
    constexpr std::size_t IOBufSize = 5'000'000u;
    static char ioBuf[IOBufSize];
    
    
    int main() {
        //createTestFile();
    
        if (std::ifstream ifs{ testDataFileName,std::ios::binary }; ifs) {
    
            Timer tOverall{}; tOverall.start();
    
            // To speed up reading of the file, we will set a bigger input buffer
            ifs.rdbuf()->pubsetbuf(ioBuf, IOBufSize);
    
            // Here we will store the complete file, all data
            std::string text{};
    
            // Get number of bytes in file
            const std::uintmax_t size = std::filesystem::file_size(testDataFileName);
            text.resize(size);
    
            // Read the whole file with one statement. Will be ultrafast
            Timer t; t.start();
            ifs.read(text.data(), size);
            t.stop(); std::cout << "Duration for reading complete file:\t\t" << t << "\t\tData read: " << ifs.gcount() << " bytes\n";
    
            // Creating a vector with string views and reserve memory. Make a big guess
            std::vector<std::string_view> lines{}; 
            lines.reserve(NumberOfRowsGuess);
    
            // Create the string views with the lines
            char* start{ text.data() };
            char* end{start};
            std::size_t index{};
    
            t.start();
            for (const char c : text) {
                ++end;
                if (c == '\n') {
                    lines.push_back({start, end });
                    start = end;
                }
            }
            std::cout << "\nNumber of lines Read: " << lines.size() << '\n';
    
            t.stop(); std::cout << "Duration for creating all string views:\t\t" << t << '\n';
            tOverall.stop(); std::cout << "\n\nDuration overall:\t\t\t\t" << tOverall << '\n';
        }
        else std::cout << "\n\nError: Could not open test file '" << testDataFileName << "'\n\n";
    }
    

    I tested the program on my 12 years old Windows 7 machine.

    Program output was:

    Duration for reading complete file:             752 ms          Data read: 1538880087 bytes
    
    Number of lines Read: 50000000
    Duration for creating all string views:         1769 ms
    
    
    Duration overall:                               2966 ms
    
    

    This should be "sufficiently" fast.