SparseQR for Least Squares

For an application I am building I need to run linear regression on large datasets in order to obtain residuals. For example, one dataset is more than 1 million x 20k in dimension. For the smaller datasets I was using fastLm from the RcppArmadillo package - which works great for those - currently. With time those datasets will also grow beyond 1 million rows.

My solution was to use sparse matrices and Eigen. I was unable to find a good example for using SparseQR in RcppEigen. Based on many hours of reading (e.g. rcpp-gallery, stackoverflow, rcpp-dev mailinglist, eigen docs, rcpp-gallery, stackoverflow and many more that I have forgotten but sure have read) I wrote the following piece of code;

(NB: my first c++ program - please be nice :) - any advice to improve is welcomed)

// [[Rcpp::depends(RcppEigen)]]
#include <RcppEigen.h>
using namespace Rcpp;
using namespace Eigen;

using Eigen::Map;
using Eigen::SparseMatrix;
using Eigen::MappedSparseMatrix;
using Eigen::VectorXd;
using Eigen::SimplicialCholesky;


// [[Rcpp::export]]
List sparseLm_eigen(const SEXP Xr, 
                    const NumericVector yr){
  
  typedef SparseMatrix<double>        sp_mat;
  typedef MappedSparseMatrix<double>  sp_matM;
  typedef Map<VectorXd>               vecM;
  typedef SimplicialCholesky<sp_mat>  solver;

  const sp_mat Xt(Rcpp::as<sp_matM>(Xr).adjoint());
  const VectorXd Xty(Xt * Rcpp::as<vecM>(yr));
  const solver Ch(Xt * Xt.adjoint());
  
  if(Ch.info() != Eigen::Success) return "failed";
  
  return List::create(Named("betahat") = Ch.solve(Xty));
}

This works for example for;

library(Matrix)
library(speedglm)
Rcpp::sourceCpp("sparseLm_eigen.cpp")

data("data1")
data1$fat1 <- factor(data1$fat1)
mm <- model.matrix(formula("y ~ fat1 + x1 + x2"), dat = data1)

sp_mm <- as(mm, "dgCMatrix")
y <- data1$y

res1 <- sparseLm_eigen(sp_mm, y)$betahat
res2 <- unname(coefficients(lm.fit(mm, y)))

abs(res1 - res2)

It fails however for my large datasets (as I kind of expected). My initial intention was to use the SparseQR as a solver but I don't know how to implement that.

So my question - can someone help me to implement QR decomposition for sparse matrices with RcppEigen?

Solution I Used

Disclaimer: I don't know if it's correct - it works for my particular problem and seems to deliver correct results.

As per @TomWenseleers comment to add my solution, here it is. I could not get Eigen's LeastSquareConjugateGradient to work. I think because as the documentation states A'A must be positive definitive. So instead of solving A = bx I am solving A'A = A'bx using Eigen's Conjugate Gradient with a diagonal preconditioner. Then using the coefficients to calculate the fitted values and the residuals.

#include <RcppEigen.h>
#include <Eigen/IterativeLinearSolvers>
// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::export]]
Rcpp::List sparse_cg(const Eigen::Map<Eigen::SparseMatrix<double> > &A,
                     const Eigen::Map<Eigen::VectorXd> &b,
                     const Eigen::Map<Eigen::VectorXd> &x0,
                     const int &maxiter,
                     const double &tol) {
    
  // Set-up the Conjugate Gradient solver
  Eigen::ConjugateGradient<Eigen::SparseMatrix<double>,
                           Eigen::Lower|Eigen::Upper,
                           Eigen::DiagonalPreconditioner<double> > cg;
  // Initialize solver
  cg.setMaxIterations(maxiter);
  cg.setTolerance(tol);
  cg.compute(A);

  // Solve system with guess (i.e. old solutions)
  Eigen::VectorXd coef = cg.solveWithGuess(b, x0);

  // Solver convergence stats
  int iter = cg.iterations();
  double err = cg.error();

  return Rcpp::List::create(Rcpp::Named("coefficients") = coef,
                            Rcpp::Named("itr") = iter,
                            Rcpp::Named("error") = err);
}

Solution

How to write a sparse solver with Eigen is a bit generic. This is mainly because the sparse solver classes are designed superbly well. They provide a guide explaining their sparse solver classes. Since the question focuses on SparseQR, the documentation indicates that there are two parameters required to initialize the solver: SparseMatrix class type and OrderingMethods class that dictates the supported fill-reducing ordering method.

With this in mind, we can whip up the following:

// [[Rcpp::depends(RcppEigen)]]
#include <RcppEigen.h>
#include <Eigen/SparseQR>

// [[Rcpp::export]]
Rcpp::List sparseLm_eigen(const Eigen::MappedSparseMatrix<double> A, 
                          const Eigen::Map<Eigen::VectorXd> b){

  Eigen::SparseQR <Eigen::MappedSparseMatrix<double>, Eigen::COLAMDOrdering<int> > solver;
  solver.compute(A);
  if(solver.info() != Eigen::Success) {
    // decomposition failed
    return Rcpp::List::create(Rcpp::Named("status") = false);
  }
  Eigen::VectorXd x = solver.solve(b);
  if(solver.info() != Eigen::Success) {
    // solving failed
    return Rcpp::List::create(Rcpp::Named("status") = false);
  }

  return Rcpp::List::create(Rcpp::Named("status") = true,
                            Rcpp::Named("betahat") = x);
}

Note: Here we create a list that always passes a named status variable that should be checked first. This indicates whether convergence happens in two areas: decomposition and solving. If all checks out, then we pass the betahat coefficient.

Test Script:

library(Matrix)
library(speedglm)
Rcpp::sourceCpp("sparseLm_eigen.cpp")

data("data1")
data1$fat1 <- factor(data1$fat1)
mm <- model.matrix(formula("y ~ fat1 + x1 + x2"), dat = data1)

sp_mm <- as(mm, "dgCMatrix")
y <- data1$y

res1 <- sparseLm_eigen(sp_mm, y)
if(res1$status != TRUE){
    stop("convergence issue")
}
res1_coef = res1$betahat
res2_coef <- unname(coefficients(lm.fit(mm, y)))

cbind(res1_coef, res2_coef)

Output:

        res1_coef    res2_coef
[1,]  1.027742926  1.027742926
[2,]  0.142334262  0.142334262
[3,]  0.044327457  0.044327457
[4,]  0.338274783  0.338274783
[5,] -0.001740012 -0.001740012
[6,]  0.046558506  0.046558506