Search code examples
rrcpp

Why is my Rcpp function crashing when using a large input matrix?


I made a simple Rcpp fucntion to calculate all pearson correlation coefficients that can be computed from all row combinations of an input matrix E. The results are stored with 4 decimals of precision (in intger format) in a vector v. The function works fine if the dimensions of E aren't too large but just crashes when I test with a data size similar to that of the real data that I want to process with the function.

Here is the Rccp code:

#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
void pearson(NumericMatrix E, IntegerVector v){
    int rows = E.nrow();
    int cols = E.ncol();
    int j, irow, jrow;
    double rowsum;
    NumericVector means(rows);
    int k = 0;
    double cov, varx, vary;
    double pearson;

    for(irow = 0; irow < rows; irow++){
        rowsum = 0;
        for(j = 0; j < cols; j++){
            rowsum += E(irow, j);
        }
        means[irow] = rowsum / cols;
    }
    
    for(irow = 0; irow < rows - 1; irow++){
        for(jrow = irow + 1; jrow < rows; jrow++){
            cov = 0;
            varx = 0;
            vary = 0;
            for(j = 0; j < cols; j++) {
                cov += (E(irow, j) - means[irow]) * (E(jrow, j) - means[jrow]);
                varx += std::pow(E(irow, j) - means[irow], 2);
                vary += std::pow(E(jrow, j) - means[jrow], 2);
            }
            pearson = cov / std::sqrt(varx * vary);
            v[k] = (int) (pearson * 10000);
            k++;
        }
    }

}

And then for testing it in R I started with the following:

library(Rcpp)
sourceCpp("pearson.cpp")
testin <- matrix(rnorm(1000 * 1100), nrow = 1000, ncol = 1100)
testout <- integer( (nrow(testin) * (nrow(testin) - 1)) / 2 )
pearson(testin, testout) # success!

However when increasing input size the R session crashes after executing the last line in the following script:

library(Rcpp)
sourceCpp("pearson.cpp")
testin <- matrix(rnorm(16000 * 17000), nrow = 16000, ncol = 17000)
testout <- integer( (nrow(testin) * (nrow(testin) - 1)) / 2 )
pearson(testin, testout) # sad

I feel like this is strange since I'm able to allocate the input and the output just fine before executing the function. Inside the function the output vector is modified by reference. Can't figure out what is wrong. Currently I'm working on machine with 16GB RAM.

EDIT: output of sessionInfo()

R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Mexico.1252 
[2] LC_CTYPE=Spanish_Mexico.1252   
[3] LC_MONETARY=Spanish_Mexico.1252
[4] LC_NUMERIC=C                   
[5] LC_TIME=Spanish_Mexico.1252    

attached base packages:
[1] stats     graphics  grDevices
[4] utils     datasets  methods  
[7] base     

other attached packages:
[1] Rcpp_1.0.5

loaded via a namespace (and not attached):
[1] compiler_4.0.4

Solution

  • Just for the sake of giving closure to this question, I tried to run the function just allocating the inputs and not running the actual algorithm as suggested in the comments and it returns just fine. I think in Windows for some reason when the input reaches a certain size the window will dim and say "not responding" next to the R console's window name. However the function is still running as it will eventually finish if left enough time and the R console's window will return to normal. The fact that the process took so long and that the window looked like when Rcpp crashes led me to think the process was not running and that it was some sort of crash.

    What I ended up doing is programming a parallel version of the algorithm with the aid of this very helpful tutorial by some of the creators of RcppParallel. Since I cannot afford using the base R cor() function due to memory constraints, making the parallel version suited my needs perfectly.