I have a very large table in R, with numeric columns giving values across several years. It looks something like this, but with many more columns and several million rows.
id year1 year2 year3 year4
ac 44 34 23 34
as 12 15 25 45
df 99 88 107 114
I would like to calculate a new column giving the trend over time of the numeric columns. I guess this could be the slope of a regression on the numeric columns. How could I calculate this for each row in an efficient manner? Suggestions using data.table
would be ideal.
So far I've been able to do this by re-shaping the data to long, and performing a regression by id
. I wondered if there was a more efficient way to do it without needing to re-shape.
Doing the linear regression manually can handle 1e7 rows and 50 columns in a few seconds:
library(Rfast) # for `rowmeans`
library(data.table)
# example data
set.seed(509437570)
nr <- 1e7 # number of rows
nc <- 50 # number of columns
dt <- data.table(id = 1:nr)[
,paste0("year", 1:nc) :=
as.data.frame(matrix(rnorm(nr*nc, 0, 5) + outer(runif(nr), 1:nc), nr, nc))
]
system.time(
dt[
,c("slope", "intercept") := .(
slope <- (((m <- as.matrix(.SD)) - (ybar <- rowmeans(m))) %*%
(dx <- seq(-(nc - 1)/2, (nc - 1)/2)))/sum(dx^2),
ybar - slope*(nc + 1)/2
), .SDcols = paste0("year", 1:nc)
]
)
#> user system elapsed
#> 4.56 0.96 3.66
dt[1:5, c("id", "slope", "intercept")]
#> id slope intercept
#> 1: 1 0.1821816 2.6856957
#> 2: 2 0.3254113 0.1084359
#> 3: 3 0.4975937 0.4861745
#> 4: 4 0.3796465 1.0722193
#> 5: 5 0.6800638 -0.3818551
Plot the first row with its fit:
plot(1:nc, unlist(dt[1, 2:(nc + 1)]))
abline(dt$intercept[1], dt$slope[1])