I have data that looks like this:
token eps rank # first line names columns
Intercept 9.362637e+00 1 # later lines hold data
A1 -2.395553e-01 30
G1 -3.864725e-01 50
T1 1.565497e-01 43
....
Different files will have different numbers of named columns and the types of values in each column will vary among floats, ints, and strings.
I want to write a readCols
function to which i send names of columns (e.g. I may want the token
and rank
columns) which will put the the data in the specified column into containers of the appropriate type.
My problem is not in parsing the file but in returning a variable number of containers which contain different types. For instance, I want the token
and rank
columns put into vector<string>
and vector<int>
containers, respectively. The issue here is that I may want the eps
column instead (stored in a vector), and I don't want to write a different readCols
function for every conceivable combination of types. (The type of container doesn't matter to me. If I have to only use vector
s, no problem; that each container contains a different type is the key.)
I'll probably need a container that holds different types to hold the different types of container. It looks like Boost.Variant might be the solution I want, but I don't know how to tell the parser which type I want each column to be (could I make something like a list of typenames? e.g. void readCols(string filename, vector<variant<various types of vector>> &data, vector<string> colNames, vector<typename> convertTo)
). Likewise, Boost.Mpl.Vector may solve the problem, but again I can't quite figure how to tell readCols
how each column wants to be cast.
I can think of at least two workarounds:
container::value_type
allows the function to know how to parse). I don't prefer this solution because the files are occasionally large (millions of lines) so parsing them multiple times would take an extra few minutes (not a negligible percentage of run-time in programs whose calculation takes ~30 minutes; the program will run over and over).std::transform
and boost::lexical_cast
or s/t. If I can avoid 2n
lines of bloat, great (n
=number of columns, typically 2 or 3, 2 lines per column to declare the container and then transform). It may be that the second workaround will require significantly less effort from me than a complete, generic solution; if that's the case, I'd like to know. I imagine that the second workaround might even be more efficient, but I'm mainly concerned with ease of use at the moment. If I can write one generic readCols
function and be done with it, that's what I'd prefer.
When things get too complicated, I break the problem into smaller parts. So here's a suggestion.
Write a CSV reader class which can read comma or other delimiter separated values from a file. The class reads a line at a time and breaks the line into std::string fields. In order to access the fields, you implement functions like getString, getInt, getDouble, etc that access the fields (by column name or index) and converts them to the appropriate type. So the reader does a well defined thing and deals with a limited number of primitive types.
Then implement reader functions (or classes) that utilize your CSV reader. These reader function know the specific types of the columns and where to put their values - either in scalars, containers, etc.