Search code examples
c++csvio

Increasing the speed of reading a csv file C++


I created this code to read and filter my csv files. It works like I want it to work for small files. But I just tried out a file of size 200k lines and it takes around 4 minutes, which is too long for my use case.

After testing a bit and fixing some quite stupid things I got the time down a little to 3 minutes. I found out about half of the Time is spent reading in the file and half of the Time is spend generating the Result Vector.

Is there any way to Improve the speed of my Programm? Especially the Reading from csv part? I do not really have an Idea at the moment. I'd appreciate any help.

EDIT:The filter is filtering the data by either a timeframe or timeframe and filterword in specific columns and outputting the data into a resulting vector of strings.

My CSV files look like this->

Headers are:

ID;Timestamp;ObjectID;UserID;Area;Description;Comment;Checksum

Data is:

523;19.05.2021 12:15;####;admin;global;Parameter changed to xxx; Comment;x3J2j4
std::ifstream input_file(strComplPath, std::ios::in);

int counter = 0;
while (std::getline(input_file, record))
{
    istringstream line(record);
    while (std::getline(line, record, delimiter))
    {
        record.erase(remove(record.begin(), record.end(), '\"'), record.end());
        items.push_back(record);
        //cout << record;
    }

    csv_contents[counter] = items;
    items.clear();
    ++counter;
}
 

for (int i = 0; i < csv_contents.size(); i++) {
    string regexline = csv_contents[i][1];
    string endtime = time_upper_bound;
    string starttime = time_lower_bound;
    bool checkline = false;
    bool isInRange = false, isLater = false, isEarlier = false;

    // Check for faulty Data and replace it with an empty string 
    for (int oo = 0; oo < 8; oo++) {
        if (csv_contents[i][oo].rfind("#", 0) == 0) {
            csv_contents[i][oo] = "";
        }
    }

    if ((regex_search(starttime, m, timestampformat) && regex_search(endtime, m, timestampformat))) {
        filtertimeboth = true;
    }
    else if (regex_search(starttime, m, timestampformat)) {
        filterfromstart = true;
    }
    else if (regex_search(endtime, m, timestampformat)) {
        filtertoend = true;
    }
}

Solution

  • I'm not sure exactly what the bottleneck is in your program (I copied your code from an earlier version of the question) but you have a lot of regex:es and mix reading records with post processing. I suggest that you create a class to hold one of these records, called record, overload operator>> for record and then use std::copy_if from the file with a filter that you can design separately from the reading. Do post processing after you've read the records that passes the filter.

    I made a small test and it takes 2 seconds to read 200k records on my old spinning disk while doing filtering. I only used time_lower_bound and time_upper_bound to filter and additional checks will of course make it a little slower, but it should not take minutes.

    Example:

    #include <algorithm>
    #include <chrono>
    #include <ctime>
    #include <fstream>
    #include <iomanip>
    #include <iostream>
    #include <iterator>
    #include <sstream>
    #include <string>
    #include <thread>
    #include <vector>
    
    // the suggested class to hold a record
    struct record {
        int ID;
        std::chrono::system_clock::time_point Timestamp;
        std::string ObjectID;
        std::string UserID;
        std::string Area;
        std::string Description;
        std::string Comment;
        std::string Checksum;
    };
    
    // A free function to read a time_point from an `istream`:
    std::chrono::system_clock::time_point to_tp(std::istream& is, const char* fmt) {
        std::chrono::system_clock::time_point tp{};
        // C++20:
        // std::chrono::from_stream(is, tp, fmt, nullptr, nullptr);
    
        // C++11 to C++17 version:
        std::tm tmtp{};
        tmtp.tm_isdst = -1;
        if(is >> std::get_time(&tmtp, fmt)) {
            tp = std::chrono::system_clock::from_time_t(std::mktime(&tmtp));
        }
        return tp;
    }
    
    // The operator>> overload to read one `record` from an `istream`:
    std::istream& operator>>(std::istream& is, record& r) {
        is >> r.ID;
        r.Timestamp = to_tp(is, ";%d.%m.%Y %H:%M;"); // using the helper function above
        std::getline(is, r.ObjectID, ';');
        std::getline(is, r.UserID, ';');
        std::getline(is, r.Area, ';');
        std::getline(is, r.Description, ';');
        std::getline(is, r.Comment, ';');
        std::getline(is, r.Checksum);
        return is;
    }
    
    // An operator<< overload to print one `record`:
    std::ostream& operator<<(std::ostream& os, const record& r) {
        std::ostringstream oss;
        oss << r.ID;
        { // I only made a C++11 to C++17 version for this one:
            std::time_t time = std::chrono::system_clock::to_time_t(r.Timestamp);
            std::tm ts = *std::localtime(&time);
            oss << ';' << ts.tm_mday << '.' << ts.tm_mon + 1 << '.'
                << ts.tm_year + 1900 << ' ' << ts.tm_hour << ':' << ts.tm_min << ';';
        }
        oss << r.ObjectID << ';' << r.UserID << ';' << r.Area << ';'
            << r.Description << ';' << r.Comment << ';' << r.Checksum << '\n';
        return os << oss.str();
    }
    
    // The reading and filtering part of `main` would then look like this:
    int main() { // not "void main()"
        std::istringstream time_lower_bound_s("20.05.2019 16:40:00");
        std::istringstream time_upper_bound_s("20.05.2021 09:40:00");
    
        // Your time boundaries as `std::chrono::system_clock::time_point`s - 
        // again using the `to_tp` helper function:
        auto time_lower_bound = to_tp(time_lower_bound_s, "%d.%m.%Y %H:%M:%S");
        auto time_upper_bound = to_tp(time_upper_bound_s, "%d.%m.%Y %H:%M:%S");
    
        // Verify that the boundaries were parsed ok:
        if(time_lower_bound == std::chrono::system_clock::time_point{} ||
           time_upper_bound == std::chrono::system_clock::time_point{}) {
            std::cerr << "failed to parse boundaries\n";
            return 1;
        }
    
        std::ifstream is("data"); // whatever your file is called
        if(is) {
            std::vector<record> recs; // a vector with all the records
    
            // create your filter
            auto filter = [&time_lower_bound, &time_upper_bound](const record& r) {
                // Only copy those `record`s within the set boundaries.
                // You can add additional conditions here too.
                return r.Timestamp >= time_lower_bound &&
                       r.Timestamp <= time_upper_bound;
            };
    
            // Copy those records that pass the filter:
            std::copy_if(std::istream_iterator<record>(is),
                         std::istream_iterator<record>{}, std::back_inserter(recs),
                         filter);
    
            // .. post process `recs` here ...
    
            // print result
            for(auto& r : recs) std::cout << r;
        }
    }