Search code examples
rbigdatar-bigmemory

Computing row sums of a big.matrix in R?


I have a big matrix with about 60 million rows and 150 colums (roughly 9 billion elements total). I have stored this data in a big.matrix object (from package bigmemory). Now, I wish to compute the sum of each row, which is a problem because big.matrix is column-oriented, so as far as I can tell all the summary functions are column oriented (e.g. colsum, colmax, etc.) and there is no function available by default for computing row sums. Of course I can do apply(x, 1, sum), but this will take a very long time. I can also loop over the columns one by one and use vectorized addition to add them:

mysum <- rep(0, nrow(x))
for (i in seq(ncol(x))) 
  mysum <- mysum + x[,i]

but this still takes over 20 minutes, and is obviously suboptimal since it is creating a new 60-million-element vector each time through the loop. It seems like there must be some faster way to do this.

Edit

I got this down to 10 minutes by processing chunks of a million or so rows at a time, and calling rowSums on those, and then concatenating the results. I'd still be interested to know if there is an optimized way to do this, though.


Solution

  • I've written some C++ code to do this, adapted from the bigmemory Rcpp gallery:

    rowSums.cpp

    // [[Rcpp::depends(BH)]]
    #include <Rcpp.h>
    using namespace Rcpp;
    
    // [[Rcpp::depends(BH, bigmemory)]]
    #include <bigmemory/MatrixAccessor.hpp>
    
    #include <numeric>
    
    // Logic for BigRowSums.
    template <typename T>
    NumericVector BigRowSums(XPtr<BigMatrix> pMat, MatrixAccessor<T> mat) {
        NumericVector rowSums(pMat->nrow(), 0.0);
        NumericVector value(1);
        for (int jj = 0; jj < pMat->ncol(); jj++) {
          for (int ii = 0; ii < pMat->nrow(); ii++) {
            value = mat[jj][ii];
            if (all(!is_na(value))) {
              rowSums[ii] += value[0];
            }   
          }   
        }   
        return rowSums;
    }
    
    // Dispatch function for BigRowSums
    //
    // [[Rcpp::export]]
    NumericVector BigRowSums(SEXP pBigMat) {
        XPtr<BigMatrix> xpMat(pBigMat);
    
        switch(xpMat->matrix_type()) {
          case 1:
            return BigRowSums(xpMat, MatrixAccessor<char>(*xpMat));
          case 2:
            return BigRowSums(xpMat, MatrixAccessor<short>(*xpMat));
          case 4:
            return BigRowSums(xpMat, MatrixAccessor<int>(*xpMat));
          case 6:
            return BigRowSums(xpMat, MatrixAccessor<float>(*xpMat));
          case 8:
            return BigRowSums(xpMat, MatrixAccessor<double>(*xpMat));
          default:
            throw Rcpp::exception("unknown type detected for big.matrix object!");
        }   
    }
    

    In R:

    library(bigmemory)
    library(Rcpp)
    sourceCpp("rowSums.cpp")
    
    m <- as.big.matrix(matrix(1:9, 3))
    BigRowSums(m@address)
    [1] 12 15 18