Search code examples
c++performancememory-managementlarge-dataistream

What is the best way to override istream for large object (over 30 fields)?


I have a very large text file that has over 11 million entries/lines. Each line has 35 values in it, each value is separated/delimited by a "|".

For each line that I am reading in, I am creating an object, "Record". I am storing them in a vector of Records because I need to be able to sort them based on the values in a given field. (Please suggest better approach if there is one)

I know how to override the istream>> operator, but I have never had to do it for an object this large, and I'm not sure what the best approach is. I tried to create tokens before each delimiter IE:

using namespace std; 

inline istream& operator>>(istream& is, Record& r) {
    string line_of_text;
    string token;
    char delim = '|';

    is >> temp;

    token = line_of_text.substr(0, line_of_text.find(delim));
    r.firstField = token;
    
    // so on for each field in Record

    return is;
}

but this is very impractical and inefficient.

Is there a reasonable way of doing this for such a large object? What is the best way to parse text like this without wasting so much memory?

Example line of input:

xx|0000|0| 0.00| 3.00|111|111| 5.70| 136000.00| 620.23| 80.00| 47.00| 0.000|FIX |P|C| 80.00|Full|SF|1.|P|convention|ME| 3| | |UnReported |WFHM |2 |N| |1|0|0|0|0|0| 126162.03| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00

I also tried just doing

inline istream& operator>>(istream& is, Record& r) {
    return is >> r.fieldOne >> r.fieldTwo; //....etc
}

but this does not work due to the fact that many fields are not separated with a space but just a '|', is there a graceful way to have >> skip the "|" as it does with blank spaces? Keep in mind there is a possibility for fields to be empty.


Solution

  • I really wanted to find a use for pointer-to-member syntax for once, so...

    You can use pointer-to-member syntax with a set of overloaded helpers to let compiler choose the correct convertor:

    struct Record
    {
        int x;
        std::string y;
        double z;
        
        void readInput(std::istream& in, int Record::*var)
        {
            std::string input;
            std::getline(in, input, '|');
            this->*var = std::stoi(input);
        }
        
        void readInput(std::istream& in, double Record::*var)
        {
            std::string input;
            std::getline(in, input, '|');
            this->*var = std::stod(input);
        }
        
        void readInput(std::istream& in, std::string Record::*var)
        {
            std::getline(in, this->*var, '|');
        }
    };
    

    With this, the operator >> would look like this:

    std::istream& operator>>(std::istream& in, Record& r)
    {
        r.readInput(in, &Record::x);
        r.readInput(in, &Record::y);
        r.readInput(in, &Record::z);
        //no need to handle last value as special case as long as stream ends there and you don't care that it will be in fail() state afterwards
        return in;
    }
    

    See it online


    It would be possible to just provide free functions, which take a reference instead of pointer to member, e.g.:

    void readInput(std::istream& in, int& var)
    {
        std::string input;
        std::getline(in, input, '|');
        var = std::stoi(input);
    }
    

    with usage in operator >> like this:

    readInput(in, r.x);
    

    The core difference between these two approaches is whether you want it to be usable only with Record or you will always want to read ints delimited by | from istreams.