Search code examples
rr-bigmemory

Removing columns from big.matrix which have only one value


I have a very large binary matrix, stored as a big.matrix to conserve memory (it is over 2 gb otherwise - 5 million columns and 100 rows).

r <- 100
c <- 10000
m4 <- matrix(sample(0:1,r*c, replace=TRUE),r,c)
m4 <- cbind(m4, 1)
m4 <- as.big.matrix(m4)

I need to remove every column which has only one unique value (in this case, only 0s or only 1s). Because of the number of columns, I want to be able to do this in parallel.

How can I accomplish this while keeping the data compressed as a big.matrix? I can convert it into a df and loop over the columns looking for the number of unique values, but this takes too much RAM.

Thanks!


Solution

  • Put that in an .cpp file and source it with Rcpp::sourceCpp:

    // [[Rcpp::depends(BH, bigmemory)]]
    #include <bigmemory/MatrixAccessor.hpp>
    #include <Rcpp.h>
    using namespace Rcpp;
    
    // [[Rcpp::export]]
    LogicalVector to_keep(SEXP bm_addr) {
    
      XPtr<BigMatrix> xptr(bm_addr);
      MatrixAccessor<double> macc(*xptr);
    
      size_t n = macc.nrow();
      size_t m = macc.ncol();
    
      double first_val;
    
      LogicalVector keep(m, false);
    
      for (size_t j = 0; j < m; j++) {
        first_val = macc[j][0];
        for (size_t i = 1; i < n; i++) {
          if (macc[j][i] != first_val) {
            keep[j] = true;
            break;
          }
        }
      }
    
      return keep;
    }
    
    /*** R
    library(bigmemory)
    r <- 100
    c <- 10000
    m4 <- matrix(sample(0:1,r*c, replace=TRUE),r,c)
    m4 <- cbind(m4, 1)
    m4 <- as.big.matrix(m4)
    m4[, 1] <- 1
    m4[, 2] <- 0
    
    keep <- to_keep(m4@address)
    m4.keep <- deepcopy(m4, cols = which(keep))
    */