I have a big matrix with about 60 million rows and 150 colums (roughly 9 billion elements total). I have stored this data in a big.matrix
object (from package bigmemory
). Now, I wish to compute the sum of each row, which is a problem because big.matrix
is column-oriented, so as far as I can tell all the summary functions are column oriented (e.g. colsum
, colmax
, etc.) and there is no function available by default for computing row sums. Of course I can do apply(x, 1, sum)
, but this will take a very long time. I can also loop over the columns one by one and use vectorized addition to add them:
mysum <- rep(0, nrow(x))
for (i in seq(ncol(x)))
mysum <- mysum + x[,i]
but this still takes over 20 minutes, and is obviously suboptimal since it is creating a new 60-million-element vector each time through the loop. It seems like there must be some faster way to do this.
I got this down to 10 minutes by processing chunks of a million or so rows at a time, and calling rowSums on those, and then concatenating the results. I'd still be interested to know if there is an optimized way to do this, though.
I've written some C++ code to do this, adapted from the bigmemory Rcpp gallery:
rowSums.cpp
// [[Rcpp::depends(BH)]]
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::depends(BH, bigmemory)]]
#include <bigmemory/MatrixAccessor.hpp>
#include <numeric>
// Logic for BigRowSums.
template <typename T>
NumericVector BigRowSums(XPtr<BigMatrix> pMat, MatrixAccessor<T> mat) {
NumericVector rowSums(pMat->nrow(), 0.0);
NumericVector value(1);
for (int jj = 0; jj < pMat->ncol(); jj++) {
for (int ii = 0; ii < pMat->nrow(); ii++) {
value = mat[jj][ii];
if (all(!is_na(value))) {
rowSums[ii] += value[0];
}
}
}
return rowSums;
}
// Dispatch function for BigRowSums
//
// [[Rcpp::export]]
NumericVector BigRowSums(SEXP pBigMat) {
XPtr<BigMatrix> xpMat(pBigMat);
switch(xpMat->matrix_type()) {
case 1:
return BigRowSums(xpMat, MatrixAccessor<char>(*xpMat));
case 2:
return BigRowSums(xpMat, MatrixAccessor<short>(*xpMat));
case 4:
return BigRowSums(xpMat, MatrixAccessor<int>(*xpMat));
case 6:
return BigRowSums(xpMat, MatrixAccessor<float>(*xpMat));
case 8:
return BigRowSums(xpMat, MatrixAccessor<double>(*xpMat));
default:
throw Rcpp::exception("unknown type detected for big.matrix object!");
}
}
In R:
library(bigmemory)
library(Rcpp)
sourceCpp("rowSums.cpp")
m <- as.big.matrix(matrix(1:9, 3))
BigRowSums(m@address)
[1] 12 15 18