i am confused with the R implementation of lag in Regression analysis

look at this linear regression: Y ~ X + lag(X,1) ,the meaning is very clear that it is trying to do a linear regression. and the lag(X,1) means the first lag of X. What confuse me is the R implementation of lag function. In R the lag(X, 1) moves X to the prior time, for example

>library(zoo) 
>
>str(zoo(x))
‘zoo’ series from 1 to 4 
Data: num [1:4] 11 12 13 14
Index:int [1:4] 1 2 3 4
>lag(zoo(x))
1  2  3
12 13 14

when you regress, which value does the R use exactly at time 2? I guess R use the data like this:

time 1   2   3   4
 Y      anything
 X   11  12  13  14
lagX 12  13  14

But this is nonsense！ Because we are supposed to use the fisrt lag of X and the current X at time 2 (or any specific time ), that is 11 and 12 , not 13 12 as above! The fisrt lag of X should be the prior X , isn't it? I am so confused! Please explain to me, thanks a lot.

Solution

The question starts out with:

look at this linear regression: Y ~ X + lag(X,1) ,the meaning is very clear that it is trying to do a linear regression. and the lag(X,1) means the first lag of X

Actually that is not the case. It does not refer to this model:

Y[i] = a + b * X[i] + c * X[i-1] + error[i]

It actually refers to this model:

Y[i] = a + b * X[i] + c * X[i+1] + error[i]

which is not likely what you intended.

It is likely that you wanted lag(X, -1) rather than lag(X, 1). Lagging a series in R means that the lagged series starts earlier which implies that the series itself moves forward.

The other item to be careful of is that lm does not align series. It knows nothing about the time index. You will need to align the series yourself or use a package which does it for you.

More on these points below.

First let us consider lag.ts from the core of R since lag.zoo and lag.zooreg are based on it and consistent with it. lag.ts lags the times of the series so that the lagged series starts earlier. That is if we have a series whose values are 11, 12, 13 and 14 at times 1, 2, 3 and 4 respectively lag.ts lags each time so that the lagged series has the same values 11, 12, 13 and 14 but at the times 0, 1, 2, 3. The original series started at 1 but the lagged series starts at 0. Originally the value 12 was at time 2 but in the lagged series the value 13 is at time 2. In code, we have:

tt <- ts(11:14)
cbind(tt, lag(tt), lag(tt, 1), lag(tt, -1))

gives:

Time Series:
Start = 0 
End = 5 
Frequency = 1 
  tt lag(tt) lag(tt, 1) lag(tt, -1)
0 NA      11         11          NA
1 11      12         12          NA
2 12      13         13          11
3 13      14         14          12
4 14      NA         NA          13
5 NA      NA         NA          14

zoo

lag.zoo is consistent with lag.ts. Note that since zoo represents irrelgularly spaced series it cannot assume that time 0 comes before time 1. We could only make such an assumption if we knew the series were regularly spaced. Thus if time 1 is the earliest time in a series the value at this time is dropped since there is no way to determine what earlier time to lag it to. The new lagged series now starts at the second time value in the original series. This is similar to the lag.ts example except in the lag.ts there was a 0 time and in this example there is no such time. Similarly we cannot extend the time scale forward in time either.

library(zoo)
z <- zoo(11:14)
merge(z, lag(z), lag(z, 1), lag(z,-1))

giving:

   z lag(z) lag(z, 1) lag(z, -1)
1 11     12        12         NA
2 12     13        13         11
3 13     14        14         12
4 14     NA        NA         13

zooreg

The zoo package does have a zooreg class which assumes regularly spaced series except for some missing values and it can deduce what comes before just as ts can. With zooreg it can deduce that time 0 comes before and time 5 comes after.

library(zoo)
zr <- zooreg(11:14)
merge(zr, lag(zr), lag(zr, 1), lag(zr,-1))

giving:

  zr lag(zr) lag(zr, 1) lag(zr, -1)
0 NA      11         11          NA
1 11      12         12          NA
2 12      13         13          11
3 13      14         14          12
4 14      NA         NA          13
5 NA      NA         NA          14

lm does not know anything about zoo and will ignore the time index entirely. If you want to not ignore it, i.e. you want to align the series involved prior to running the regression, use the dyn (or dynlm) package. Using the former:

library(dyn)
set.seed(123)
zr <- zooreg(rnorm(10))
y <- 1 + 2 * zr + 3 * lag(zr, -1)
dyn$lm(y ~ zr + lag(zr, -1))

giving:

Call:
lm(formula = dyn(y ~ zr + lag(zr, -1)))

Coefficients:
(Intercept)           zr  lag(zr, -1)  
          1            2            3

Note 1: Be sure to read the documentation in the help files: ?lag.ts , ?lag.zoo , ?lag.zooreg and help(package = dyn)

Note 2: If the direction of the lag seems confusing you could define your own function and use that in place of lag. For example, this gives the same coefficients as the lm output shown above:

Lag <- function(x, k = 1) lag(x, -k)
dyn$lm(y ~ zr + Lag(zr))

An additional word of warning is that unlike lag.zoo and lag.zooreg which are consistent with the core of R, lag.xts from the xts package is inconsistent. Also the lag in dplyr is also inconsistent (and to make things worse if you load dplyr then dplyr will mask lag in R with its own inconsistent version of lag. Also note that L in dynlm works the same as Lag but wisely used a different name to avoid confusion.