Search code examples
rrcpprcpp11

Avoid SIGSEGV when subsetting data.frame with call to `[data.frame` in Rcpp


My Rcpp code is occasionally failing (SEGFAULT, etc.) for reasons I don't understand. The code creates a large data.frame, and then tries to obtain a subset of this data.frame by calling the R subset function, [.data.frame), from within the same method that is creating the frame. A very simplified version of it is shown below:

library(Rcpp)
src <- '// R function to subset data.frame - what will be called to subset
DataFrame test() {
Function subsetinR("[.data.frame"); 

// Make a dataframe in Rcpp to subset
size_t n = 100;
auto df =  DataFrame::create(Named("a") = std::vector<double> (n, 2.0),
                             Named("b") = std::vector<double> (n, 4.0));

// Now make a vector to subset with 
LogicalVector filter = LogicalVector::create(n, TRUE);
for (size_t i =0; i < n; i++) {
    if (i % 2 == 0) filter[i] = FALSE;
}   

// Subset, here is where it fails!
df = subsetinR(df, filter, R_MissingArg);
return df; 
}'  

fun <- cppFunction(plugins=c("cpp11"), src, verbose = TRUE, depends="Rcpp") 
fun()

However, while this occasionally works, it will other times it fails with the following error:

*** caught segfault ***
   address 0x7ff700000030, cause 'memory not mapped'`

Anyone know what is going wrong?

Note: This is not a duplicate. I have seen other stack overflow answers which create vectors by exploiting subsetting on each vector, e.g.

  // Next up, create a new DataFrame Object with selected rows subset. 
  return Rcpp::DataFrame::create(Rcpp::Named("val1")  = val1[idx],
                                 Rcpp::Named("val2")  = val2[idx],
                                 Rcpp::Named("val3")  = val3[idx],
                                 Rcpp::Named("val3")  = val4[idx]
                                 );

However, I am explicitly looking to avoid the repeated [idx] subsetting, as the idx is not known when the data.frame is constructed (it is only known at the end), and I am hoping to find a way that doesn't involve repeatedly invoking that. If it's possible to transform the data.frame at the end with one go though, that would work just fine.


Solution

  • The problem here is that LogicalVector::create() is not doing what you expect here -- it's returning a vector of length two, with the elements TRUE and TRUE. In other words, your code:

    LogicalVector filter = LogicalVector::create(n, TRUE);

    generates not a logical vector of length n with values TRUE, but instead a logical vector of length two with the first element being 'truthy' and so TRUE, and the second explicitly TRUE.

    You likely intended to just use the regular constructor, e.g. LogicalVector(n, TRUE).