I have a very large text file that has over 11 million entries/lines. Each line has 35 values in it, each value is separated/delimited by a "|".
For each line that I am reading in, I am creating an object, "Record". I am storing them in a vector of Records because I need to be able to sort them based on the values in a given field. (Please suggest better approach if there is one)
I know how to override the istream>> operator, but I have never had to do it for an object this large, and I'm not sure what the best approach is. I tried to create tokens before each delimiter IE:
using namespace std;
inline istream& operator>>(istream& is, Record& r) {
string line_of_text;
string token;
char delim = '|';
is >> temp;
token = line_of_text.substr(0, line_of_text.find(delim));
r.firstField = token;
// so on for each field in Record
return is;
}
but this is very impractical and inefficient.
Is there a reasonable way of doing this for such a large object? What is the best way to parse text like this without wasting so much memory?
Example line of input:
xx|0000|0| 0.00| 3.00|111|111| 5.70| 136000.00| 620.23| 80.00| 47.00| 0.000|FIX |P|C| 80.00|Full|SF|1.|P|convention|ME| 3| | |UnReported |WFHM |2 |N| |1|0|0|0|0|0| 126162.03| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00
I also tried just doing
inline istream& operator>>(istream& is, Record& r) {
return is >> r.fieldOne >> r.fieldTwo; //....etc
}
but this does not work due to the fact that many fields are not separated with a space but just a '|', is there a graceful way to have >> skip the "|" as it does with blank spaces? Keep in mind there is a possibility for fields to be empty.
I really wanted to find a use for pointer-to-member syntax for once, so...
You can use pointer-to-member syntax with a set of overloaded helpers to let compiler choose the correct convertor:
struct Record
{
int x;
std::string y;
double z;
void readInput(std::istream& in, int Record::*var)
{
std::string input;
std::getline(in, input, '|');
this->*var = std::stoi(input);
}
void readInput(std::istream& in, double Record::*var)
{
std::string input;
std::getline(in, input, '|');
this->*var = std::stod(input);
}
void readInput(std::istream& in, std::string Record::*var)
{
std::getline(in, this->*var, '|');
}
};
With this, the operator >>
would look like this:
std::istream& operator>>(std::istream& in, Record& r)
{
r.readInput(in, &Record::x);
r.readInput(in, &Record::y);
r.readInput(in, &Record::z);
//no need to handle last value as special case as long as stream ends there and you don't care that it will be in fail() state afterwards
return in;
}
It would be possible to just provide free functions, which take a reference instead of pointer to member, e.g.:
void readInput(std::istream& in, int& var)
{
std::string input;
std::getline(in, input, '|');
var = std::stoi(input);
}
with usage in operator >>
like this:
readInput(in, r.x);
The core difference between these two approaches is whether you want it to be usable only with Record
or you will always want to read ints delimited by |
from istreams.