I have a matrix with ~ 30 000 rows and 15 columns, and a vector of size 1x15. I want to fit a spline with each row in the matrix as dependent variable, and the vector as a predictor. For each spline, I want to do a prediction with a single x-value, and add all predictions into a vector.
Is there any way to skip a for-loop to solve this and decrease the time complexity?
Here is some example data:
Matrix:
1 0.9999866 0.9999833 0.9999822 0.9998178 0.9996189 0.9994455 0.007492490 0.007492490 0.007492195 0.007464383 0.0003291809 0.0003291808 0.00002728396 0.000017999925
1 0.9997588 0.9990516 0.9990033 0.9959569 0.9942259 0.9920646 0.063989436 0.063989428 0.063980612 0.063502466 0.0052701181 0.0052700809 0.00079669065 0.000497011826
1 0.9882412 0.7925734 0.7920651 0.7890917 0.7312206 0.7283561 0.424428825 0.423345436 0.422478875 0.409804031 0.2936134533 0.2902640241 0.13727615950 0.085531730428
Vector:
0.000 5689.072 11915.687 19188.547 27796.767 37742.035 49564.349 64430.295 84381.754 111870.835 149611.382 221043.651 362982.876 583956.304 1120546.126
My current solution is to loop over the rows of the matrix, append each row and the vector to a temporary data frame, fit the spline and make predictions. I have also tried sapply, with very limited improvement in time complexity.
The for-loop solution:
library(splines)
beta = function(matrix, vector){
predictions = c()
for (i in 1:nrow(matrix)){
# Temporary data frame
temp_df = data.frame(y = matrix[i,], x = vector)
# Fit a spline for each observation
spline = lm(y ~ ns(x, df = 7), data = temp_df)
# Predict a value and add to vector
predictions = c(predictions, predict(spline, data.frame(x = 10000)))
}
return(predictions)
}
system.time(beta(matrix, vector)) # user 62.89
The sapply solution:
fun = function(i){
predictions = c()
temp_df = data.frame(y = matrix[i, ], x = vector)
predictions = c(predictions, predict(lm(y ~ ns(x, df = 7), data = temp_df), data.frame(x = 10000)))
return(predictions)
}
system.time(sapply(1:nrow(matrix), fun)) # user: 53.42
The type of spline is not crucial for my solution, neither are the package used. I have tried to fit the spline directly for all rows of the matrix at the same time, without success.
I need to be able to expand the matrix to around 1 500 000 x 15, and do several different predictions on short notice. Would really appreciate some help on this.
Thanks in advance!
You could transpose your 30000x15 matrix to 15x30000 and do a multivariate linear model on it, then then predictions would all happen in a single line.
mat <- read.table(text="1 0.9999866 0.9999833 0.9999822 0.9998178 0.9996189 0.9994455 0.007492490 0.007492490 0.007492195 0.007464383 0.0003291809 0.0003291808 0.00002728396 0.000017999925
1 0.9997588 0.9990516 0.9990033 0.9959569 0.9942259 0.9920646 0.063989436 0.063989428 0.063980612 0.063502466 0.0052701181 0.0052700809 0.00079669065 0.000497011826
1 0.9882412 0.7925734 0.7920651 0.7890917 0.7312206 0.7283561 0.424428825 0.423345436 0.422478875 0.409804031 0.2936134533 0.2902640241 0.13727615950 0.085531730428")
vec <- c(0.000, 5689.072, 11915.687, 19188.547, 27796.767, 37742.035, 49564.349, 64430.295, 84381.754, 111870.835, 149611.382, 221043.651, 362982.876, 583956.304, 1120546.126)
mat <- t(mat)
colnames(mat) <- paste0("y", 1:ncol(mat))
dat <- cbind(as.data.frame(mat), x=vec)
form <- paste0("cbind(", paste(colnames(mat), collapse=", "), ") ~ ns(x, df=7)")
library(splines)
mod <- lm(form, data=dat)
coef(mod)
#> y1 y2 y3
#> (Intercept) 1.0320976 1.0303319 1.0356140
#> ns(x, df = 7)1 0.2965524 0.2774531 -0.1627959
#> ns(x, df = 7)2 -0.6004987 -0.5797548 -0.5066507
#> ns(x, df = 7)3 -1.2040751 -1.1148353 -0.6408537
#> ns(x, df = 7)4 -0.9342747 -0.9346508 -0.6687842
#> ns(x, df = 7)5 -1.0382844 -1.0351010 -0.8216596
#> ns(x, df = 7)6 -1.2107326 -1.1990489 -1.1924609
#> ns(x, df = 7)7 -0.9446160 -0.9469103 -0.8121099
predict(mod, newdata=data.frame(x=10000))
#> y1 y2 y3
#> 1 0.9415758 0.9442323 0.8442096
Created on 2023-04-01 with reprex v2.0.2
On my machine (MacBook Pro, M1 Max), I get about 27 seconds for the for-loop and 15 seconds for the multivariate linear model on a 30k x 15 matrix of values.
Another alternative would be to just do out the matrix multiplication to make the coefficients and then generate the predictions from the coefficients.
matfun <- function(matrix, vector){
require(Matrix)
n <- ns(vector, df=7)
X <- model.matrix(~ns(vector, df=7))
z <- solve(t(X) %*% X) %*% t(X)
b <- crossprod(t(z), t(matrix))
predX <- c(1, ns(10000, df=7, knots=attr(n, "knots"), Boundary.knots = attr(n, "Boundary.knots")))
preds <- t(b) %*% predX
return(preds)
}
For me, matfun(mat, vet)
for a 30k x 15 matrix took 0.007 seconds - way faster than either of the other alternatives. All three solutions produce predictions that correlate with each other at 1. If all you want is the predictions, the matrix function is the fastest.