Search code examples
rfor-looplmsapplyspline

Is there an alternative to for-loop to fit a spline with a matrix as dependent variable?


I have a matrix with ~ 30 000 rows and 15 columns, and a vector of size 1x15. I want to fit a spline with each row in the matrix as dependent variable, and the vector as a predictor. For each spline, I want to do a prediction with a single x-value, and add all predictions into a vector.

Is there any way to skip a for-loop to solve this and decrease the time complexity?

Here is some example data:

Matrix:

1 0.9999866 0.9999833 0.9999822 0.9998178 0.9996189 0.9994455 0.007492490 0.007492490 0.007492195 0.007464383 0.0003291809 0.0003291808 0.00002728396 0.000017999925
1 0.9997588 0.9990516 0.9990033 0.9959569 0.9942259 0.9920646 0.063989436 0.063989428 0.063980612 0.063502466 0.0052701181 0.0052700809 0.00079669065 0.000497011826
1 0.9882412 0.7925734 0.7920651 0.7890917 0.7312206 0.7283561 0.424428825 0.423345436 0.422478875 0.409804031 0.2936134533 0.2902640241 0.13727615950 0.085531730428

Vector:

0.000    5689.072   11915.687   19188.547   27796.767   37742.035   49564.349   64430.295   84381.754  111870.835  149611.382  221043.651  362982.876  583956.304 1120546.126

My current solution is to loop over the rows of the matrix, append each row and the vector to a temporary data frame, fit the spline and make predictions. I have also tried sapply, with very limited improvement in time complexity.

The for-loop solution:

library(splines)
beta = function(matrix, vector){
  predictions = c()
  for (i in 1:nrow(matrix)){
    # Temporary data frame
    temp_df = data.frame(y = matrix[i,], x = vector)
    
    # Fit a spline for each observation
    
    spline = lm(y ~ ns(x, df = 7), data = temp_df)
    # Predict a value and add to vector
    predictions = c(predictions, predict(spline, data.frame(x = 10000)))
  }
  return(predictions)
}
system.time(beta(matrix, vector)) # user 62.89

The sapply solution:

fun = function(i){
  predictions = c()
  temp_df = data.frame(y = matrix[i, ], x = vector) 
  predictions = c(predictions, predict(lm(y ~ ns(x, df = 7), data = temp_df), data.frame(x = 10000)))
  return(predictions)
}
system.time(sapply(1:nrow(matrix), fun)) # user: 53.42

The type of spline is not crucial for my solution, neither are the package used. I have tried to fit the spline directly for all rows of the matrix at the same time, without success.

I need to be able to expand the matrix to around 1 500 000 x 15, and do several different predictions on short notice. Would really appreciate some help on this.

Thanks in advance!


Solution

  • You could transpose your 30000x15 matrix to 15x30000 and do a multivariate linear model on it, then then predictions would all happen in a single line.

    mat <- read.table(text="1 0.9999866 0.9999833 0.9999822 0.9998178 0.9996189 0.9994455 0.007492490 0.007492490 0.007492195 0.007464383 0.0003291809 0.0003291808 0.00002728396 0.000017999925
    1 0.9997588 0.9990516 0.9990033 0.9959569 0.9942259 0.9920646 0.063989436 0.063989428 0.063980612 0.063502466 0.0052701181 0.0052700809 0.00079669065 0.000497011826
    1 0.9882412 0.7925734 0.7920651 0.7890917 0.7312206 0.7283561 0.424428825 0.423345436 0.422478875 0.409804031 0.2936134533 0.2902640241 0.13727615950 0.085531730428")
    
    vec <- c(0.000, 5689.072, 11915.687, 19188.547, 27796.767, 37742.035, 49564.349, 64430.295, 84381.754, 111870.835, 149611.382, 221043.651, 362982.876, 583956.304, 1120546.126)
    mat <- t(mat)
    colnames(mat) <- paste0("y", 1:ncol(mat))
    dat <- cbind(as.data.frame(mat), x=vec)
    
    form <- paste0("cbind(", paste(colnames(mat), collapse=", "), ") ~ ns(x, df=7)")
    
    library(splines)
    mod <- lm(form, data=dat)
    coef(mod)
    #>                        y1         y2         y3
    #> (Intercept)     1.0320976  1.0303319  1.0356140
    #> ns(x, df = 7)1  0.2965524  0.2774531 -0.1627959
    #> ns(x, df = 7)2 -0.6004987 -0.5797548 -0.5066507
    #> ns(x, df = 7)3 -1.2040751 -1.1148353 -0.6408537
    #> ns(x, df = 7)4 -0.9342747 -0.9346508 -0.6687842
    #> ns(x, df = 7)5 -1.0382844 -1.0351010 -0.8216596
    #> ns(x, df = 7)6 -1.2107326 -1.1990489 -1.1924609
    #> ns(x, df = 7)7 -0.9446160 -0.9469103 -0.8121099
    
    predict(mod, newdata=data.frame(x=10000))
    #>          y1        y2        y3
    #> 1 0.9415758 0.9442323 0.8442096
    

    Created on 2023-04-01 with reprex v2.0.2

    On my machine (MacBook Pro, M1 Max), I get about 27 seconds for the for-loop and 15 seconds for the multivariate linear model on a 30k x 15 matrix of values.

    Another alternative would be to just do out the matrix multiplication to make the coefficients and then generate the predictions from the coefficients.

    matfun <- function(matrix, vector){
      require(Matrix)
      n <- ns(vector, df=7)
      X <- model.matrix(~ns(vector, df=7)) 
      z <- solve(t(X) %*% X) %*% t(X)
      b <- crossprod(t(z), t(matrix))
      predX <- c(1, ns(10000, df=7, knots=attr(n, "knots"), Boundary.knots = attr(n, "Boundary.knots")))
      preds <- t(b) %*% predX
      return(preds)
    }
    

    For me, matfun(mat, vet) for a 30k x 15 matrix took 0.007 seconds - way faster than either of the other alternatives. All three solutions produce predictions that correlate with each other at 1. If all you want is the predictions, the matrix function is the fastest.