Search code examples
rrcpp

Rcpp import list / dataframe from R with a large number of variable


I am new to Rcpp programming and I cannot figure out a very basic thing.

I am trying to import an large list from R to c++. I am using Rcpp. The list I have contains about 400,000 rows and 50 columns. I am recreating a smaller version of it here for your reference.

df1 = data.frame(Variable1=c(1,2,3,4,5,6,7,8,9,10,1),Variable2=c(11,12,13,14,15,16,17,18,19,20,11),
             Variable3 = c(1,0,0,1,1,0,0,0,1,0,1),
             Variable4=c(1,1,1,1,2,2,2,2,2,2,2),
             Variable5=c(20,-2,-5,10,30,2,1,.5,50,-1,60))

This is a dataframe object. I know from this post (how many vectors can be added in DataFrame::create( vec1, vec2 ... )?) that you can only import dataframe objects in Rcpp with 20 columns. You could also have as many columns as you want from Kevin Ushey's post here (how many vectors can be added in DataFrame::create( vec1, vec2 ... )?). I would prefer to not use the dataframe route since I need to write a fairly complicated function.

My confusion comes from the following: when I use

typeof(df1)

R tells me that this is a list object.

What would be the best way to import this data in Rcpp? Could someone point me to a source/ show me a replica code that I might be able to use for my dataset (please note my dataset has 50 columns)?

Any help/advice will be greatly appreciated.


Solution

  • As @RalfStubner and @duckmayr mentioned, you may have been misreading the existing restrictions on construction. There are no restrictions on accepting existing data frame objects.

    To illustrate, here is a not entirely sensible example of a 500 column data.frame (which, for simplicity, we assume to contain only numeric vectors) where we sum up all elements in the first row.

    Code

    #include <Rcpp.h>
    
    // [[Rcpp::export]]
    double extractFromBigDataFrame(Rcpp::DataFrame d, bool verbose=false) {
      int n = d.length();
      double sum = 0;
      for (int i=0; i<n; i++) {
        // we are making a simplifying assumption here that each column is numeric
        Rcpp::NumericVector x = d[i];
        double elem = x[0];
        sum += elem;
        if (verbose) print(x);
      }
      return sum;
    }
    
    /*** R
    m <- matrix(1:1000, 2, 500)
    d <- as.data.frame(m)
    extractFromBigDataFrame(d)
    rowSums(m)  # comparison
    */
    

    Output

    R> Rcpp::sourceCpp("/tmp/so54563983.cpp")
    
    R> m <- matrix(1:1000, 2, 500)
    
    R> d <- as.data.frame(m)
    
    R> extractFromBigDataFrame(d)
    [1] 250000
    
    R> rowSums(m)  # comparison
    [1] 250000 250500
    R>