Rcpp implementation of fast nonnegative least squares?

I was on the lookout of a fast implementation in R of the fast (active set based) nonnegative least squares algorithm of Bro, R., & de Jong, S. (1997) A fast non-negativity-constrained least squares algorithm. Journal of Chemometrics, 11, 393-401. In the multiway package I found this pure R implementation:

fnnls <- 
  function(XtX,Xty,ntol=NULL){     
    ### initialize variables
    pts <- length(Xty)
    if(is.null(ntol)){
      ntol <- 10*(.Machine$double.eps)*max(colSums(abs(XtX)))*pts
    }
    pvec <- matrix(0,1,pts)
    Zvec <- matrix(1:pts,pts,1)
    beta <- zvec <- t(pvec)
    zz <- Zvec
    wvec <- Xty - XtX%*%beta
    
    ### iterative procedure
    iter <- 0    
    itmax <- 30*pts
    
    # outer loop
    while(any(Zvec>0) && any(wvec[zz]>ntol)) {
      
      tt <- zz[which.max(wvec[zz])]
      pvec[1,tt] <- tt
      Zvec[tt] <- 0
      pp <- which(pvec>0)
      zz <- which(Zvec>0)
      nzz <- length(zz)
      zvec[pp] <- smpower(XtX[pp,pp],-1)%*%Xty[pp]
      zvec[zz] <- matrix(0,nzz,1)
      
      # inner loop
      while(any(zvec[pp]<=ntol) &&  iter<itmax) {
        
        iter <- iter + 1
        qq <- which((zvec<=ntol) & t(pvec>0))
        alpha <- min(beta[qq]/(beta[qq]-zvec[qq]))
        beta <- beta + alpha*(zvec-beta)
        indx <- which((abs(beta)<ntol) & t(pvec!=0))
        Zvec[indx] <- t(indx)
        pvec[indx] <- matrix(0,1,length(indx))
        pp <- which(pvec>0)
        zz <- which(Zvec>0)
        nzz <- length(zz)
        if(length(pp)>0){
          zvec[pp] <- smpower(XtX[pp,pp],-1)%*%Xty[pp]
        }
        zvec[zz] <- matrix(0,nzz,1)      
        
      } # end inner loop
      
      beta <- zvec
      wvec <- Xty - XtX%*%beta
      
    } # end outer loop
    
    beta
    
  }

but in my tests it is way slower than the plain nnls function in the nnls package (coded in fortran), even though algorithmically fnnls should be faster. I was wondering if anyone would happen to have an Rcpp port of fnnls available, ideally using armadillo classes and allowing X to be sparse and maybe also supporting Y to have multiple columns? Or some other nnls implementation that works with sparse covariate matrices and multiple right hand sides?

Solution

EDIT Jan 2022: Use the nnls function in the RcppML R package on CRAN. This is an even faster Eigen-based implementation of the function given below, followed up by a coordinate descent least squares refinement.

I've spent almost a week on this problem, for research purposes.

I've also spent almost two days trying to parse through the multiway::fnnls implementation, and will refrain from exercising choice words on R etiquette, interpretability, and memory usage.

I fail to understand why the authors of multiway::fnnls think their implementation should be fast at all. The R-only implementation seems useless given the fortran Lawson/Hanson implementation.

Here's an RcppArmadillo function I wrote (Fast Approximate Solution Trajectory) NNLS which replicates multiway::fnnls for well-conditioned systems:

//[[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

using namespace arma;
typedef unsigned int uint;

// [[Rcpp::export]]
vec fastnnls(mat a, vec b) {

  // initial x is the unbounded least squares solution
  vec x = arma::solve(a, b, arma::solve_opts::likely_sympd + arma::solve_opts::fast);
  
  while (any(x < 0)) {

    // define the feasible set as all values greater than 0
    arma::uvec nz = find(x > 0);
    
    // reset x
    x.zeros();
    
    // solve the least squares solution for values in the feasible set
    x.elem(nz) = solve(a.submat(nz, nz), b.elem(nz), arma::solve_opts::likely_sympd + arma::solve_opts::fast);
  }
  return x;
}

This approach is essentially the first half of TNT-NN, but without the "heuristics" at each iteration trying to add or remove elements from the feasible set.

To move this approach beyond a simple approximation, we can add sequential coordinate descent which receives the FAST solution above as initialization. In general, for most small well-conditioned problems, 99% of the time, FAST gives the exact solution.

A unique property of the above implementation is that it cannot give false positives, but sometimes (in large or ill-conditioned systems) gives false negatives. Thus, may be slightly sparser than the actual solution. Note that loss is generally within 1% between FAST and the exact solution, so if you're not after the absolute exact solution this is your best bet.

The above algorithm also operates much faster than the Lawson/Hanson nnls solver. Here's a microbenchmark I just copied over from a 50-coefficient system, replicated 10000 times:

Unit: microseconds
               expr   min    lq      mean median    uq     max neval
           fastnnls  53.9  56.2  59.32761   58.0  59.5   359.7 10000
 lawson/hanson nnls 112.9 116.7 125.96169  118.6 129.5 11032.4 10000

Of course, performance varies based on density and negativity. My algorithm tends to get faster with increasing sparsity, and faster with fewer positive solutions, compared to other algorithms.

I have tried and failed to simplify the multiway::fnnls code and run it into Armadillo.

I am working to implement this method as an Rcpp package and will post up when it makes it to a stable Github release.

p.s.: it is possible to make this faster using Eigen. The armadillo solver uses Cholesky decomposition and direct substitution. Eigen's Cholesky solver is faster because it does more operations in-place.