Fast and efficient character DataFrame creation in Rcpp

I have written a parser that reads in character values into a std::vector<std::string> vector, it parses 1 million records in sub seconds. Then I want to convert the vectors to a Rcpp::DataFrame, which takes more than than 80 seconds...

Is there a way to create an Rcpp::DataFrame from large character vectors "efficiently"?

When using numeric values, I would try to std::memcpy() the std::vector to Rcpp::NumericVector (see this int64 example or this data.table example for more information), but this does not seem to work with character vectors due to their varying sizes.

More Information

The basic idea of the function is to parse Sudoku string data (each Sudoku string is exactly 81 characters long), there are two Sudokus in each line (the data is saved as a .csv file, you can find the data here).

$ head sudoku.csv
quizzes,solutions
004300209005009001070060043006002087190007400050083000600000105003508690042910300,864371259325849761971265843436192587198657432257483916689734125713528694542916378
040100050107003960520008000000000017000906800803050620090060543600080700250097100,346179258187523964529648371965832417472916835813754629798261543631485792254397186
600120384008459072000006005000264030070080006940003000310000050089700000502000190,695127384138459672724836915851264739273981546946573821317692458489715263562348197
497200000100400005000016098620300040300900000001072600002005870000600004530097061,497258316186439725253716498629381547375964182841572639962145873718623954534897261
005910308009403060027500100030000201000820007006007004000080000640150700890000420,465912378189473562327568149738645291954821637216397854573284916642159783891736425
100005007380900000600000480820001075040760020069002001005039004000020100000046352,194685237382974516657213489823491675541768923769352841215839764436527198978146352
009065430007000800600108020003090002501403960804000100030509007056080000070240090,289765431317924856645138729763891542521473968894652173432519687956387214178246395
000000657702400100350006000500020009210300500047109008008760090900502030030018206,894231657762495183351876942583624719219387564647159328128763495976542831435918276
503070190000006750047190600400038000950200300000010072000804001300001860086720005,563472198219386754847195623472638519951247386638519472795864231324951867186723945

Inside of the cpp read function, I fread() the file, fill a buffer (buffer) and parse the data to said std::vector<std::string> vectors (a and b in this example)

Note that the full code including the experiments I have done so far can be found in this gist.

const int BUFFERSIZE = 1e8;
const int n_lines = count_lines(filename); // 1 million in this case

FILE* infile;
infile = fopen(filename.c_str(), "r");
unsigned char * buffer;
buffer = (unsigned char*) malloc(BUFFERSIZE);
int64_t this_buffer_size;
std::vector<std::string> a, b;
a.resize(n_lines);
b.resize(n_lines);

// removing of header not shown here...
// BUFFERSIZE is also checked so that no overflow occurs... not shown here..

int line = 0;
while ((this_buffer_size = fread(buffer, 1, BUFFERSIZE, infile)) > 0) {
  int i = 1;
  while (i < buffer) {
    // buffer from i to i + 81 would look like this:
    // 004300209005009001070060043006002087190007400050083000600000105003508690042910300
    // whereas for b it looks from i to i + 81 like this:
    // 864371259325849761971265843436192587198657432257483916689734125713528694542916378
    a[line] = std::string(buffer + i, buffer + i + 81);
    i += 81 + 1; // skip to the next value, +1 for the , or a newline
    b[line] = std::string(buffer + i, buffer + i + 81);
    i += 81 + 1; // skip to the next value, +1 for the , or a newline
    line++;
  }
  // check next buffer, not shown here...
}

// NEXT: parse the data to an R structure

This takes for the 1 million row dataset below 250ms.

Then I want to create a Rcpp::DataFrame from the two vectors a and b, this is where the question comes in. The conversion to the R object takes around 80 seconds.

Is there any faster alternative, taking the data knowledge into account (2 items per row, each 81 chars long, 1 million rows, ...)?

I am not bound on filling the std::vectors first, if possible I can also directly collect the data in an Rcpp structure.

What I have tried so far

Textbook solution

Rcpp::DataFrame df = Rcpp::DataFrame::create(
  Rcpp::Named("unsolved") = a,
  Rcpp::Named("solved") = b,
  Rcpp::Named("stringsAsFactors") = false
);

List first


Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

df["unsolved"] = a;
df["solved"] = b;

df.attr("class") = Rcpp::CharacterVector::create("data.frame");

Character Matrix

Not truly comparable to the other approaches, but felt more native...

// before the main loop
std::vector<std::string> vec;
// vec holds both data entries, the first (unsolved) at 0 -> n_lines and solved values at n_lines -> n_lines * 2
vec.resize(2 * n_lines);


// inside the loop
vec[l] = std::string(buffer + i, buffer + i + 81);
i += 82;
vec[l + n_lines] = std::string(buffer + i, buffer + i + 81);
i += 82;
l++;


// to CharacterMatrix
Rcpp::CharacterMatrix res(n_lines, 2, vec.begin());

Full Code and Timings on Github

Solution

Thanks for making a snapshot of the data available (BTW: no point tar'ing a single file, you could just have xz'ed the csvfile. Anyway.)

I get different results on my Ubuntu 20.04 box which are closer to what I anticipated:

data.table::fread() is competitive as we expected (I am running data.table from git as there was a regression in the most recent release)
vroom and stringfish, once we force materialization to compare apples to apples rather than images of apples, are about the same
Rcpp is in the ballpark too but a little more variable

I capped it at 10 runs, the variability probably comes down if you run more but the caching influences it too.

In short: no clear winners, and surely no mandate to replace one of the (alreadty known to be tuned) reference implementations.

edd@rob:~/git/stackoverflow/65043010(master)$ Rscript bm.R
Unit: seconds
  expr     min      lq    mean  median      uq     max neval cld
 fread 1.37294 1.51211 1.54004 1.55138 1.57639 1.62939    10   a
 vroom 1.44670 1.53659 1.62104 1.61172 1.61764 1.88921    10   a
 sfish 1.21609 1.57000 1.57635 1.60180 1.63933 1.72975    10   a
 rcpp1 1.44111 1.45354 1.61275 1.55190 1.60535 2.15847    10   a
 rcpp2 1.47902 1.57970 1.75067 1.60114 1.64857 2.75851    10   a
edd@rob:~/git/stackoverflow/65043010(master)$

Code for top-level script

suppressMessages({
    library(data.table)
    library(Rcpp)
    library(vroom)
    library(stringfish)
    library(microbenchmark)
})

vroomread <- function(csvfile) {
    a <- vroom(csvfile, col_types = "cc", progress = FALSE)
    vroom:::vroom_materialize(a, TRUE)
}
sfread <- function(csvfile) {
    a <- sf_readLines(csvfile)
    dt <- data.table::data.table(uns = sf_substr(a, 1, 81),
                                 sol = sf_substr(a, 83, 163))
}

sourceCpp("rcppfuncs.cpp")


csvfile <- "sudoku_100k.csv"
microbenchmark(fread=fread(csvfile),
               vroom=vroomread(csvfile),
               sfish=sfread(csvfile),
               rcpp1=setalloccol(read_to_df_ifstream(csvfile)),
               rcpp2=setalloccol(read_to_df_ifstream_charvector(csvfile)),
               times=10)

Code for Rcpp script

#include <Rcpp.h>
#include <fstream>

//[[Rcpp::export]]

Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename, std::ifstream::in);

  std::string line;
  // burn the header
  std::getline(file, line);

  std::vector<std::string> a, b;
  a.reserve(n_lines);
  b.reserve(n_lines);

  while (std::getline(file, line)) {
    a.push_back(line.substr(0, 80));
    b.push_back(line.substr(82, 162));
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}

//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
  const int n_lines = 1000000;
  std::ifstream file(filename, std::ifstream::in);

  std::string line;
  // burn the header
  std::getline(file, line);

  Rcpp::CharacterVector a(n_lines), b(n_lines);

  int l = 0;
  while (std::getline(file, line)) {
    a(l) = line.substr(0, 80);
    b(l) = line.substr(82, 162);
    l++;
  }

  Rcpp::List df(2);
  df.names() = Rcpp::CharacterVector::create("unsolved", "solved");

  df["unsolved"] = a;
  df["solved"] = b;

  df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");

  return df;
}