For an application I am building I need to run linear regression on large datasets in order to obtain residuals. For example, one dataset is more than 1 million x 20k in dimension. For the smaller datasets I was using fastLm
from the RcppArmadillo
package - which works great for those - currently. With time those datasets will also grow beyond 1 million rows.
My solution was to use sparse matrices and Eigen. I was unable to find a good example for using SparseQR in RcppEigen. Based on many hours of reading (e.g. rcpp-gallery, stackoverflow, rcpp-dev mailinglist, eigen docs, rcpp-gallery, stackoverflow and many more that I have forgotten but sure have read) I wrote the following piece of code;
(NB: my first c++ program - please be nice :) - any advice to improve is welcomed)
// [[Rcpp::depends(RcppEigen)]]
#include <RcppEigen.h>
using namespace Rcpp;
using namespace Eigen;
using Eigen::Map;
using Eigen::SparseMatrix;
using Eigen::MappedSparseMatrix;
using Eigen::VectorXd;
using Eigen::SimplicialCholesky;
// [[Rcpp::export]]
List sparseLm_eigen(const SEXP Xr,
const NumericVector yr){
typedef SparseMatrix<double> sp_mat;
typedef MappedSparseMatrix<double> sp_matM;
typedef Map<VectorXd> vecM;
typedef SimplicialCholesky<sp_mat> solver;
const sp_mat Xt(Rcpp::as<sp_matM>(Xr).adjoint());
const VectorXd Xty(Xt * Rcpp::as<vecM>(yr));
const solver Ch(Xt * Xt.adjoint());
if(Ch.info() != Eigen::Success) return "failed";
return List::create(Named("betahat") = Ch.solve(Xty));
}
This works for example for;
library(Matrix)
library(speedglm)
Rcpp::sourceCpp("sparseLm_eigen.cpp")
data("data1")
data1$fat1 <- factor(data1$fat1)
mm <- model.matrix(formula("y ~ fat1 + x1 + x2"), dat = data1)
sp_mm <- as(mm, "dgCMatrix")
y <- data1$y
res1 <- sparseLm_eigen(sp_mm, y)$betahat
res2 <- unname(coefficients(lm.fit(mm, y)))
abs(res1 - res2)
It fails however for my large datasets (as I kind of expected). My initial intention was to use the SparseQR
as a solver but I don't know how to implement that.
So my question - can someone help me to implement QR decomposition for sparse matrices with RcppEigen?
Solution I Used
Disclaimer: I don't know if it's correct - it works for my particular problem and seems to deliver correct results.
As per @TomWenseleers comment to add my solution, here it is. I could not get Eigen's LeastSquareConjugateGradient to work. I think because as the documentation states A'A
must be positive definitive. So instead of solving A = bx
I am solving A'A = A'bx
using Eigen's Conjugate Gradient with a diagonal preconditioner. Then using the coefficients to calculate the fitted values and the residuals.
#include <RcppEigen.h>
#include <Eigen/IterativeLinearSolvers>
// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::export]]
Rcpp::List sparse_cg(const Eigen::Map<Eigen::SparseMatrix<double> > &A,
const Eigen::Map<Eigen::VectorXd> &b,
const Eigen::Map<Eigen::VectorXd> &x0,
const int &maxiter,
const double &tol) {
// Set-up the Conjugate Gradient solver
Eigen::ConjugateGradient<Eigen::SparseMatrix<double>,
Eigen::Lower|Eigen::Upper,
Eigen::DiagonalPreconditioner<double> > cg;
// Initialize solver
cg.setMaxIterations(maxiter);
cg.setTolerance(tol);
cg.compute(A);
// Solve system with guess (i.e. old solutions)
Eigen::VectorXd coef = cg.solveWithGuess(b, x0);
// Solver convergence stats
int iter = cg.iterations();
double err = cg.error();
return Rcpp::List::create(Rcpp::Named("coefficients") = coef,
Rcpp::Named("itr") = iter,
Rcpp::Named("error") = err);
}
How to write a sparse solver with Eigen is a bit generic. This is mainly because the sparse solver classes are designed superbly well. They provide a guide explaining their sparse solver classes. Since the question focuses on SparseQR, the documentation indicates that there are two parameters required to initialize the solver: SparseMatrix class type and OrderingMethods class that dictates the supported fill-reducing ordering method.
With this in mind, we can whip up the following:
// [[Rcpp::depends(RcppEigen)]]
#include <RcppEigen.h>
#include <Eigen/SparseQR>
// [[Rcpp::export]]
Rcpp::List sparseLm_eigen(const Eigen::MappedSparseMatrix<double> A,
const Eigen::Map<Eigen::VectorXd> b){
Eigen::SparseQR <Eigen::MappedSparseMatrix<double>, Eigen::COLAMDOrdering<int> > solver;
solver.compute(A);
if(solver.info() != Eigen::Success) {
// decomposition failed
return Rcpp::List::create(Rcpp::Named("status") = false);
}
Eigen::VectorXd x = solver.solve(b);
if(solver.info() != Eigen::Success) {
// solving failed
return Rcpp::List::create(Rcpp::Named("status") = false);
}
return Rcpp::List::create(Rcpp::Named("status") = true,
Rcpp::Named("betahat") = x);
}
Note: Here we create a list that always passes a named status
variable that should be checked first. This indicates whether convergence happens in two areas: decomposition and solving. If all checks out, then we pass the betahat
coefficient.
Test Script:
library(Matrix)
library(speedglm)
Rcpp::sourceCpp("sparseLm_eigen.cpp")
data("data1")
data1$fat1 <- factor(data1$fat1)
mm <- model.matrix(formula("y ~ fat1 + x1 + x2"), dat = data1)
sp_mm <- as(mm, "dgCMatrix")
y <- data1$y
res1 <- sparseLm_eigen(sp_mm, y)
if(res1$status != TRUE){
stop("convergence issue")
}
res1_coef = res1$betahat
res2_coef <- unname(coefficients(lm.fit(mm, y)))
cbind(res1_coef, res2_coef)
Output:
res1_coef res2_coef
[1,] 1.027742926 1.027742926
[2,] 0.142334262 0.142334262
[3,] 0.044327457 0.044327457
[4,] 0.338274783 0.338274783
[5,] -0.001740012 -0.001740012
[6,] 0.046558506 0.046558506