Search code examples
rdata.tableregression

Calculate new column giving trend/slope across other columns using data.table R


I have a very large table in R, with numeric columns giving values across several years. It looks something like this, but with many more columns and several million rows.

id    year1   year2   year3   year4
ac       44      34      23      34
as       12      15      25      45
df       99      88     107     114

I would like to calculate a new column giving the trend over time of the numeric columns. I guess this could be the slope of a regression on the numeric columns. How could I calculate this for each row in an efficient manner? Suggestions using data.table would be ideal.

So far I've been able to do this by re-shaping the data to long, and performing a regression by id. I wondered if there was a more efficient way to do it without needing to re-shape.


Solution

  • Doing the linear regression manually can handle 1e7 rows and 50 columns in a few seconds:

    library(Rfast) # for `rowmeans`
    library(data.table)
    
    # example data
    set.seed(509437570)
    nr <- 1e7 # number of rows
    nc <- 50 # number of columns
    dt <- data.table(id = 1:nr)[
      ,paste0("year", 1:nc) :=
        as.data.frame(matrix(rnorm(nr*nc, 0, 5) + outer(runif(nr), 1:nc), nr, nc))
    ]
    
    system.time(
      dt[
        ,c("slope", "intercept") := .(
          slope <- (((m <- as.matrix(.SD)) - (ybar <- rowmeans(m))) %*%
            (dx <- seq(-(nc - 1)/2, (nc - 1)/2)))/sum(dx^2),
          ybar - slope*(nc + 1)/2
        ), .SDcols = paste0("year", 1:nc)
      ]
    )
    #>    user  system elapsed 
    #>    4.56    0.96    3.66
    
    dt[1:5, c("id", "slope", "intercept")]
    #>    id     slope  intercept
    #> 1:  1 0.1821816  2.6856957
    #> 2:  2 0.3254113  0.1084359
    #> 3:  3 0.4975937  0.4861745
    #> 4:  4 0.3796465  1.0722193
    #> 5:  5 0.6800638 -0.3818551
    

    Plot the first row with its fit:

    plot(1:nc, unlist(dt[1, 2:(nc + 1)]))
    abline(dt$intercept[1], dt$slope[1])