Backwards in time rollapply

I have a time series dataset with 2 columns : x is "hourly" continuous temperature data and y is periodically sampled response data (periodic samples taken at 5am , 2pm, 8pm every day) over a couple weeks.

I would like to do 2 lag approaches to analyse the data

1) plot all my y data (constant) vs increasingly lagged x data (shift x data by 0-24 hours in 1 hour steps) i.e x at 6pm vs y at 6pm; x at 5pm vs y at 6pm ...... x(5pm on previous day) vs y (6pm)

2) The same as 1) but cumulative shifts i.e. "backward in time" cumulative lag window of 0:24 with a step of 1 for the x data and test it against the y data i.e x at 6pm vs y at 6pm; x at (avg 5pm & 6pm) vs y at 6pm ...... x(Average of 6pm - 5pm on previous day) vs y (6pm)

I want to plot a linear model (lm) of "y" vs "shifted x" for each lag scenario (0 - 24) and make a table with a column for number of lags, p-value of lm; and Adj. R2 of lm) so I can see which lag and cumulative average lag in "x" best explains the y-data.

Essentially it is the same as the "cummean" or the "rollapply" functions, but working in a backward direction, but I could not find anything in R that does this. Flipping the X data does not work as the order of the data needs to be maintained as i need the lag for in x for several y's

I would guess it would require a 'for' loop to run through all the data at each lag with "i" being the lag

A Single run with 0 lag will be like this :

#Creating dummy data 

x<- zoo(c(10,10.5,10.5,11,11.5,12,12.5,12,12,12.5,13,12.5,12,12,11.5,10.5), as.Date(1:16))
y<- zoo(c(rep("NA",3),40,rep("NA",3),45,rep("NA",3),50,rep("NA",3),40), as.Date(1:16))
z<-merge(x, y, all = FALSE)
z

reslt<-lm(z$y~z$x)
a<-summary(reslt)$coefficients[2,4]
b<-summary(reslt)$adj.r.squared

ResltTable<-c(a,b)
colnames(ResltTable)<-c("p-value","Adj. R^2")

Thanks !

Solution

This will regress y against the value of x i periods ago iterating over i. Note that in the question "NA" is used where NA should be used. Also the question refers to hourly but provides daily data so we show daily lags. dyn$lm runs lm adding automatic alignment. (Note that a new version of dyn was released to CRAN yesterday that addresses changes in R in the development version of R.) We have run this for lags 0, 1, 2, ..., 10 but if you have more data you could run it up to higher values. If you want to lag in the reverse direction then replace -i with i in lag. If you want to use all lags from 0 to i then use lag(x, 0:-i) and adjust the cbind statement appropriately.

library(dyn) # also loads zoo


x <- zoo(c(10,10.5,10.5,11,11.5,12,12.5,12,12,12.5,13,12.5,12,12,11.5,10.5), as.Date(1:16))
y <- zoo(c(rep(NA,3),40,rep(NA,3),45,rep(NA,3),50,rep(NA,3),40), as.Date(1:16))
z < -merge(x, y, all = FALSE)
z


k <- 10 # highest lag to consider
tab <- t(sapply(0:10, function(i) {
  fm <- dyn$lm(y ~ lag(x, -i), z)
  s <- summary(fm)
  cbind(i, coef(fm)[1], coef(fm)[2], coef(s)[2, 4], s$adj.r.squared)
}))
colnames(tab) <- c("Lag", "Intercept", "Slope", "P Value", "Adj R Sq")
tab

giving:

> tab
      Lag  Intercept       Slope    P Value   Adj R Sq
 [1,]   0 -13.750000   5.0000000 0.04653741  0.8636364
 [2,]   1  -2.542373   3.8983051 0.09717103  0.7226502
 [3,]   2  -1.944444   3.8888889 0.29647353  0.2424242
 [4,]   3  14.651163   2.5581395 0.49421946 -0.1162791
 [5,]   4  70.357143  -2.1428571 0.78770438 -0.7857143
 [6,]   5  53.571429  -0.7142857 0.87896228 -0.9285714
 [7,]   6  58.461538  -1.1538462 0.84557904 -0.8846154
 [8,]   7  57.884615  -1.1538462 0.84557904 -0.8846154
 [9,]   8 160.000000 -10.0000000        NaN        NaN
[10,]   9 102.500000  -5.0000000        NaN        NaN
[11,]  10 120.000000  -6.6666667        NaN        NaN