transforming DF into time series for dynamic lm estimation

I have imported and stored in a dataframe some variables to perfom basic regressions and statistical analysis. Starting from the time series of these variables I built up my DF and attached to it also a Date variable, to have a clear reference time when plotting. The DF looks broadly like this (I take just a random part):

     time        ffr      cpi          gap
266 2013-04-01    0.12   0.75         -4.17
267 2013-07-01    0.09   1.90         -3.85
268 2013-10-01    0.09   1.28         -3.34
269 2014-01-01    0.07   1.32         -3.94
270 2014-04-01    0.09   1.98         -3.24
271 2014-07-01    0.09   1.31         -2.60
272 2014-10-01    0.10  -0.02         -2.47
273 2015-01-01    0.11  -0.06         -2.68
274 2015-04-01    0.12   2.02         -2.10
275 2015-07-01    0.13   1.24         -1.98
276 2015-10-01    0.16   0.78         -2.11

Now, when I run a simple regression like

reg1<-lm(df, ffr ~ cpi + gap)

everything works fine with expected results. But when I try a sligthly more sophisticated model with an autoregressive part, lags and forwards, things gets quite messy, and the solutions I found on the Web do not seem to work in my case. Below are some exemples:

reg2<-lm(df, ffr ~ cpi + gap + lag(ffr))

this gives a perfect fit, because what actually happens is that ffr is regressed on iteself without lags. Then I follow what I find elsewhere and turn the dataframe in the time series format, by

df<-xts(df, order.by=df$time)

and then

reg3<-lm(df, ffr ~ cpi + gap + lag(ffr))

which actually gives super strange results, since it appears -- in my understanding -- that all the observations of cpi, gap and ffr are used as variables. Here the ouput of the regression

Call:
lm(formula = ffr ~ cpi + gap + lag(ffr), data = small2)

Residuals:
ALL 11 residuals are 0: no residual degrees of freedom!

Coefficients: (16 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)
(Intercept)         3         NA      NA       NA
cpi-0.06            1         NA      NA       NA
cpi 0.75            2         NA      NA       NA
cpi 0.78            4         NA      NA       NA
cpi 1.24            3         NA      NA       NA
cpi 1.28           -1         NA      NA       NA
cpi 1.31           -1         NA      NA       NA
cpi 1.32           -2         NA      NA       NA
cpi 1.90           -1         NA      NA       NA
cpi 1.98           -1         NA      NA       NA
cpi 2.02            2         NA      NA       NA
gap-2.10           NA         NA      NA       NA
gap-2.11           NA         NA      NA       NA
gap-2.47           NA         NA      NA       NA
gap-2.60           NA         NA      NA       NA
gap-2.68           NA         NA      NA       NA
gap-3.24           NA         NA      NA       NA
gap-3.34           NA         NA      NA       NA
gap-3.85           NA         NA      NA       NA
gap-3.94           NA         NA      NA       NA
gap-4.17           NA         NA      NA       NA
lag(ffr)0.09       NA         NA      NA       NA
lag(ffr)0.10       NA         NA      NA       NA
lag(ffr)0.11       NA         NA      NA       NA
lag(ffr)0.12       NA         NA      NA       NA
lag(ffr)0.13       NA         NA      NA       NA

lag(ffr)0.16       NA         NA      NA       NA

Residual standard error: NA on 0 degrees of freedom
Multiple R-squared:     NA, Adjusted R-squared:     NA 
F-statistic:    NA on 10 and 0 DF,  p-value: NA

and the following warnings

Warning messages:
1: In model.response(mf, "numeric") :
  using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
3: In Ops.factor(r, 2) : ‘^’ not meaningful for factors

The same applies when using zoo instead of xts. Then I try the dyn package, with the data being both in the form of a dataframe and a xts/zoo object: nothing works and I get a perfect fit and usual errors, respectively. Using the package dynlm, nothing changes. Any hints or ideas about what is goin on?

Ah, after transforming the orginal dataframe in xts, it looks like this

           time         ffr    cpi     gap    
2013-04-01 "2013-04-01" "0.12" " 0.75" "-4.17"
2013-07-01 "2013-07-01" "0.09" " 1.90" "-3.85"
2013-10-01 "2013-10-01" "0.09" " 1.28" "-3.34"
2014-01-01 "2014-01-01" "0.07" " 1.32" "-3.94"
2014-04-01 "2014-04-01" "0.09" " 1.98" "-3.24"
2014-07-01 "2014-07-01" "0.09" " 1.31" "-2.60"
2014-10-01 "2014-10-01" "0.10" "-0.02" "-2.47"
2015-01-01 "2015-01-01" "0.11" "-0.06" "-2.68"
2015-04-01 "2015-04-01" "0.12" " 2.02" "-2.10"
2015-07-01 "2015-07-01" "0.13" " 1.24" "-1.98"
2015-10-01 "2015-10-01" "0.16" " 0.78" "-2.11"

So I wonder if the whole problem is that the transformation fails to convert the DF.

Solution

You could simply calculate the lag yourself, using shift to add a new column to your dataframe:

df$lag1 <- shift(df$ffr)
reg3<-lm(ffr ~ cpi + gap + lag1, df)

Result using your 11 rows:

> summary(reg3)

Call:
lm(formula = ffr ~ cpi + gap + lag1, data = df)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.012355 -0.006234 -0.004345  0.003007  0.019277 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.0983353  0.0362563   2.712   0.0350 *
cpi         -0.0009486  0.0058926  -0.161   0.8774  
gap          0.0215892  0.0066774   3.233   0.0178 *
lag1         0.6821619  0.2476126   2.755   0.0331 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01254 on 6 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.844, Adjusted R-squared:  0.7659 
F-statistic: 10.82 on 3 and 6 DF,  p-value: 0.007808

Alternatively, converting to time series and using dynlm:

dft <- as.ts(df)
library(dynlm)
reg4 <- dynlm(ffr ~ cpi + gap + L(ffr,1), dft)

Results:

> summary(reg4)

Time series regression with "ts" data:
Start = 2, End = 11

Call:
dynlm(formula = ffr ~ cpi + gap + L(ffr, 1), data = dft)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.012355 -0.006234 -0.004345  0.003007  0.019277 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.0983353  0.0362563   2.712   0.0350 *
cpi         -0.0009486  0.0058926  -0.161   0.8774  
gap          0.0215892  0.0066774   3.233   0.0178 *
L(ffr, 1)    0.6821619  0.2476126   2.755   0.0331 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01254 on 6 degrees of freedom
Multiple R-squared:  0.844, Adjusted R-squared:  0.7659 
F-statistic: 10.82 on 3 and 6 DF,  p-value: 0.007808

Hope it helps.

EDIT after comments: some clarifications on why lag did not work.

Maybe you will see more clearly how lag works in a time series with this toy example in which the series has proper time values:

> test <- ts(rnorm(48), start=c(2012), frequency=12)
             Jan         Feb         Mar         Apr         May         Jun         Jul         Aug         Sep         Oct
2012  0.55388567 -1.44187059 -1.81896266 -1.44285425 -1.37991005 -0.49844787 -1.26719606 -0.49876644  1.89507307 -0.74584888
2013  1.55083914  0.15779179  0.58075346  0.90677437  0.31632688 -0.20882555  0.05336465 -0.22241098 -0.11031220  0.12591051
2014  1.49442765  1.87654149 -1.18599539  1.72865701 -0.90245650  0.19460586  0.16168719  0.16245094  1.30435313  1.27952402
2015  0.53370893 -0.74539203 -0.47584512  0.19720682 -1.50906070 -0.21765018  1.03436621 -0.42588233 -0.15680010 -1.46725844
             Nov         Dec
2012  0.64720686 -0.88955517
2013  0.53687326 -0.04852013
2014  0.02273335  0.33675748
2015 -0.24954432 -0.89610509
> lag(test)
             Jan         Feb         Mar         Apr         May         Jun         Jul         Aug         Sep         Oct
2011                                                                                                                        
2012 -1.44187059 -1.81896266 -1.44285425 -1.37991005 -0.49844787 -1.26719606 -0.49876644  1.89507307 -0.74584888  0.64720686
2013  0.15779179  0.58075346  0.90677437  0.31632688 -0.20882555  0.05336465 -0.22241098 -0.11031220  0.12591051  0.53687326
2014  1.87654149 -1.18599539  1.72865701 -0.90245650  0.19460586  0.16168719  0.16245094  1.30435313  1.27952402  0.02273335
2015 -0.74539203 -0.47584512  0.19720682 -1.50906070 -0.21765018  1.03436621 -0.42588233 -0.15680010 -1.46725844 -0.24954432
             Nov         Dec
2011              0.55388567
2012 -0.88955517  1.55083914
2013 -0.04852013  1.49442765
2014  0.33675748  0.53370893
2015 -0.89610509

The function is not really changing the column itself, but the time values in which it is associated. However, doing it with the "normal" dataframe in your example:

> df$ffr
 [1] 0.12 0.09 0.09 0.07 0.09 0.09 0.10 0.11 0.12 0.13 0.16
> lag(df$ffr)
 [1] 0.12 0.09 0.09 0.07 0.09 0.09 0.10 0.11 0.12 0.13 0.16
attr(,"tsp")
[1]  0 10  1

You see that even if it is not a time series, lag is adding a tsp attribute to it (see ?tsp), but the values themselves do not change, nor the indexes, and that's why you see a perfect fit when you use it with lm.

On the other hand, if you do it with the dataframe converted to time series,

> dft[,2]
Time Series:
Start = 1 
End = 11 
Frequency = 1 
 [1] 0.12 0.09 0.09 0.07 0.09 0.09 0.10 0.11 0.12 0.13 0.16
> lag(dft[,2])
Time Series:
Start = 0 
End = 10 
Frequency = 1 
 [1] 0.12 0.09 0.09 0.07 0.09 0.09 0.10 0.11 0.12 0.13 0.16

again it is changing the metadata but not the values or the indexes, and lm does not understand the difference.

As a side point, you could select your lag when using shift, second argument by default to 1, see ?shift.

Hope it helps.