Search code examples

Using Rollapply to return both the Coefficient and RSquare

I have a dataset that looks something like this:


I would like to calculate the rolling regression coefficient and rsquared over the last 10 items:

dtset[,coefficient:=rollapply(1:20,width=10,FUN=function(a) {
  subdtset <- dtset[a]
  reg <-$x, rep(1,nrow(subdtset))), nrow=nrow(subdtset), ncol=2), subdtset$y)
dtset[,rsquare:=rollapply(1:20,width=10,FUN=function(a) {
  subdtset <- dtset[a]
  reg <-$x, rep(1,nrow(subdtset))), nrow=nrow(subdtset), ncol=2), subdtset$y)
  return(1 - sum((subdtset$y - reg$fitted.values)^2) / sum((subdtset$y - mean(subdtset$y, na.rm=TRUE))^2))

The code above accomplishes this, but my dataset has millions of rows and I have multiple columns where I want to make these calculations so it is taking a very long time. I am hoping there is a way to speed things up:

  1. Is there a better way to capture the last 10 items in rollapply rather than passing the row numbers as the variable a and then doing subdtset <- dtset[a]? I tried using .SD and .SDcols but was unable to get that to work. I can only figure out how to get rollapply to accept one column or vector as the input, not two columns/vectors.
  2. Is there a way to return 2 values from one rollapply statement? I think I could get significant time savings if I only had to do the regression once, and then from that take the coefficient and calculate RSquare. It's pretty inefficient to do the same calculations twice.

Thanks for the help!


  • Use by.column = FALSE to pass both columns to the function. In the function calculate the slope and r squared directly to avoid the overhead of Note that rollapply can return a vector and that rollapplyr with an r on the end is right aligned. This also works if dtset consists of a single x column followed by multiple y columns as in the example below with the builtin anscombe data frame.

    stats <- function(X, x = X[, 1], y = X[, -1]) {
      c(slope = cov(x, y) / var(x), rsq = cor(x, y)^2)
    rollapplyr(dtset, 10, stats, by.column = FALSE, fill = NA)
    a <- anscombe[c("x3", "y1", "y2", "y3")]
    rollapplyr(a, 3, stats, by.column = FALSE, fill = NA)


    We check the formulas using the built-in BOD data frame.

    fm <- lm(demand ~ Time, BOD)
    c(coef(fm)[[2]], summary(fm)$r.squared)
    ## [1] 1.7214286 0.6449202
    ##     slope       rsq 
    ## 1.7214286 0.6449202