Search code examples
c++rrcpp

Rcpp Create DataFrame with Variable Number of Columns


I am interested in using Rcpp to create a data frame with a variable number of columns. By that, I mean that the number of columns will be known only at runtime. Some of the columns will be standard, but others will be repeated n times where n is the number of features I am considering in a particular run.

I am aware that I can create a data frame as follows:

IntegerVector i1(3); i1[0]=4;i1[1]=2134;i1[2]=3453;
IntegerVector i2(3); i2[0]=4123;i2[1]=343;i2[2]=99123;
DataFrame df = DataFrame::create(Named("V1")=i1,Named("V2")=i2);

but in this case it is assumed that the number of columns is 2.

To simplify the explanation of what I need, assume that I would like pass a SEXP variable specifying the number of columns to create in the variable part. Something like:

RcppExport SEXP myFunc(SEXP n, SEXP <other stuff>)
IntegerVector i1(3); <compute i1>
IntegerVector i2(3); <compute i2>
for(int i=0;i<n;i++){compute vi}
DataFrame df = DataFrame::create(Named("Num")=i1,Named("ID")=i2,...,other columns v1 to vn);

where n is passed as an argument. The final data frame in R would look like

Num ID V1 ... Vn
  1  2  5     'aasda'
  ...

(In reality, the column names will not be of the form "Vx", but they will be known at runtime.) In other words, I cannot use a static list of

Named()=...

since the number will change.

I have tried skipping the "Named()" part of the constructor and then naming the columns at the end, but the results are junk.

Can this be done?


Solution

  • If I understand your question correctly, it seems like it would be easiest to take advantage of the DataFrame constructor that takes a List as an argument (since the size of a List can be specified directly), and set the names of your columns via .attr("names") and a CharacterVector:


    #include <Rcpp.h>
    
    // [[Rcpp::export]]
    Rcpp::DataFrame myFunc(int n, Rcpp::List lst, 
                           Rcpp::CharacterVector Names = Rcpp::CharacterVector::create()) {
    
      Rcpp::List tmp(n + 2);
      tmp[0] = Rcpp::IntegerVector(3);
      tmp[1] = Rcpp::IntegerVector(3);
    
      Rcpp::CharacterVector lnames = Names.size() < lst.size() ?
        lst.attr("names") : Names;
      Rcpp::CharacterVector names(n + 2);
      names[0] = "Num";
      names[1] = "ID";
    
      for (std::size_t i = 0; i < n; i++) {
        // tmp[i + 2] = do_something(lst[i]);
        tmp[i + 2] = lst[i];
        if (std::string(lnames[i]).compare("") != 0) {
          names[i + 2] = lnames[i];
        } else {
          names[i + 2] = "V" + std::to_string(i);
        }
      }
      Rcpp::DataFrame result(tmp);
      result.attr("names") = names;
      return result;
    }
    

    There's a little extra going on there to allow the Names vector to be optional - e.g. if you just use a named list you can omit the third argument.


    lst1 <- list(1L:3L, 1:3 + .25, letters[1:3])
    ##
    > myFunc(length(lst1), lst1, c("V1", "V2", "V3"))
    #  Num ID V1   V2 V3
    #1   0  0  1 1.25  a
    #2   0  0  2 2.25  b
    #3   0  0  3 3.25  c
    
    lst2 <- list(
      Column1 = 1L:3L,
      Column2 = 1:3 + .25,
      Column3 = letters[1:3],
      Column4 = LETTERS[1:3])
    ##
    > myFunc(length(lst2), lst2)
    #  Num ID Column1 Column2 Column3 Column4
    #1   0  0       1    1.25       a       A
    #2   0  0       2    2.25       b       B
    #3   0  0       3    3.25       c       C
    

    Just be aware of the 20-length limit for this signature of the DataFrame constructor, as pointed out by @hrbrmstr.