Search code examples
rtime-serieslaglm

transforming DF into time series for dynamic lm estimation


I have imported and stored in a dataframe some variables to perfom basic regressions and statistical analysis. Starting from the time series of these variables I built up my DF and attached to it also a Date variable, to have a clear reference time when plotting. The DF looks broadly like this (I take just a random part):

     time        ffr      cpi          gap
266 2013-04-01    0.12   0.75         -4.17
267 2013-07-01    0.09   1.90         -3.85
268 2013-10-01    0.09   1.28         -3.34
269 2014-01-01    0.07   1.32         -3.94
270 2014-04-01    0.09   1.98         -3.24
271 2014-07-01    0.09   1.31         -2.60
272 2014-10-01    0.10  -0.02         -2.47
273 2015-01-01    0.11  -0.06         -2.68
274 2015-04-01    0.12   2.02         -2.10
275 2015-07-01    0.13   1.24         -1.98
276 2015-10-01    0.16   0.78         -2.11

Now, when I run a simple regression like

reg1<-lm(df, ffr ~ cpi + gap)

everything works fine with expected results. But when I try a sligthly more sophisticated model with an autoregressive part, lags and forwards, things gets quite messy, and the solutions I found on the Web do not seem to work in my case. Below are some exemples:

reg2<-lm(df, ffr ~ cpi + gap + lag(ffr))

this gives a perfect fit, because what actually happens is that ffr is regressed on iteself without lags. Then I follow what I find elsewhere and turn the dataframe in the time series format, by

df<-xts(df, order.by=df$time)

and then

reg3<-lm(df, ffr ~ cpi + gap + lag(ffr))

which actually gives super strange results, since it appears -- in my understanding -- that all the observations of cpi, gap and ffr are used as variables. Here the ouput of the regression

Call:
lm(formula = ffr ~ cpi + gap + lag(ffr), data = small2)

Residuals:
ALL 11 residuals are 0: no residual degrees of freedom!

Coefficients: (16 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)
(Intercept)         3         NA      NA       NA
cpi-0.06            1         NA      NA       NA
cpi 0.75            2         NA      NA       NA
cpi 0.78            4         NA      NA       NA
cpi 1.24            3         NA      NA       NA
cpi 1.28           -1         NA      NA       NA
cpi 1.31           -1         NA      NA       NA
cpi 1.32           -2         NA      NA       NA
cpi 1.90           -1         NA      NA       NA
cpi 1.98           -1         NA      NA       NA
cpi 2.02            2         NA      NA       NA
gap-2.10           NA         NA      NA       NA
gap-2.11           NA         NA      NA       NA
gap-2.47           NA         NA      NA       NA
gap-2.60           NA         NA      NA       NA
gap-2.68           NA         NA      NA       NA
gap-3.24           NA         NA      NA       NA
gap-3.34           NA         NA      NA       NA
gap-3.85           NA         NA      NA       NA
gap-3.94           NA         NA      NA       NA
gap-4.17           NA         NA      NA       NA
lag(ffr)0.09       NA         NA      NA       NA
lag(ffr)0.10       NA         NA      NA       NA
lag(ffr)0.11       NA         NA      NA       NA
lag(ffr)0.12       NA         NA      NA       NA
lag(ffr)0.13       NA         NA      NA       NA

lag(ffr)0.16       NA         NA      NA       NA

Residual standard error: NA on 0 degrees of freedom
Multiple R-squared:     NA, Adjusted R-squared:     NA 
F-statistic:    NA on 10 and 0 DF,  p-value: NA

and the following warnings

Warning messages:
1: In model.response(mf, "numeric") :
  using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
3: In Ops.factor(r, 2) : ‘^’ not meaningful for factors

The same applies when using zoo instead of xts. Then I try the dyn package, with the data being both in the form of a dataframe and a xts/zoo object: nothing works and I get a perfect fit and usual errors, respectively. Using the package dynlm, nothing changes. Any hints or ideas about what is goin on?

Ah, after transforming the orginal dataframe in xts, it looks like this

           time         ffr    cpi     gap    
2013-04-01 "2013-04-01" "0.12" " 0.75" "-4.17"
2013-07-01 "2013-07-01" "0.09" " 1.90" "-3.85"
2013-10-01 "2013-10-01" "0.09" " 1.28" "-3.34"
2014-01-01 "2014-01-01" "0.07" " 1.32" "-3.94"
2014-04-01 "2014-04-01" "0.09" " 1.98" "-3.24"
2014-07-01 "2014-07-01" "0.09" " 1.31" "-2.60"
2014-10-01 "2014-10-01" "0.10" "-0.02" "-2.47"
2015-01-01 "2015-01-01" "0.11" "-0.06" "-2.68"
2015-04-01 "2015-04-01" "0.12" " 2.02" "-2.10"
2015-07-01 "2015-07-01" "0.13" " 1.24" "-1.98"
2015-10-01 "2015-10-01" "0.16" " 0.78" "-2.11"

So I wonder if the whole problem is that the transformation fails to convert the DF.


Solution

  • You could simply calculate the lag yourself, using shift to add a new column to your dataframe:

    df$lag1 <- shift(df$ffr)
    reg3<-lm(ffr ~ cpi + gap + lag1, df)
    

    Result using your 11 rows:

    > summary(reg3)
    
    Call:
    lm(formula = ffr ~ cpi + gap + lag1, data = df)
    
    Residuals:
          Min        1Q    Median        3Q       Max 
    -0.012355 -0.006234 -0.004345  0.003007  0.019277 
    
    Coefficients:
                  Estimate Std. Error t value Pr(>|t|)  
    (Intercept)  0.0983353  0.0362563   2.712   0.0350 *
    cpi         -0.0009486  0.0058926  -0.161   0.8774  
    gap          0.0215892  0.0066774   3.233   0.0178 *
    lag1         0.6821619  0.2476126   2.755   0.0331 *
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.01254 on 6 degrees of freedom
      (1 observation deleted due to missingness)
    Multiple R-squared:  0.844, Adjusted R-squared:  0.7659 
    F-statistic: 10.82 on 3 and 6 DF,  p-value: 0.007808
    

    Alternatively, converting to time series and using dynlm:

    dft <- as.ts(df)
    library(dynlm)
    reg4 <- dynlm(ffr ~ cpi + gap + L(ffr,1), dft)
    

    Results:

    > summary(reg4)
    
    Time series regression with "ts" data:
    Start = 2, End = 11
    
    Call:
    dynlm(formula = ffr ~ cpi + gap + L(ffr, 1), data = dft)
    
    Residuals:
          Min        1Q    Median        3Q       Max 
    -0.012355 -0.006234 -0.004345  0.003007  0.019277 
    
    Coefficients:
                  Estimate Std. Error t value Pr(>|t|)  
    (Intercept)  0.0983353  0.0362563   2.712   0.0350 *
    cpi         -0.0009486  0.0058926  -0.161   0.8774  
    gap          0.0215892  0.0066774   3.233   0.0178 *
    L(ffr, 1)    0.6821619  0.2476126   2.755   0.0331 *
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.01254 on 6 degrees of freedom
    Multiple R-squared:  0.844, Adjusted R-squared:  0.7659 
    F-statistic: 10.82 on 3 and 6 DF,  p-value: 0.007808
    

    Hope it helps.


    EDIT after comments: some clarifications on why lag did not work.

    Maybe you will see more clearly how lag works in a time series with this toy example in which the series has proper time values:

    > test <- ts(rnorm(48), start=c(2012), frequency=12)
                 Jan         Feb         Mar         Apr         May         Jun         Jul         Aug         Sep         Oct
    2012  0.55388567 -1.44187059 -1.81896266 -1.44285425 -1.37991005 -0.49844787 -1.26719606 -0.49876644  1.89507307 -0.74584888
    2013  1.55083914  0.15779179  0.58075346  0.90677437  0.31632688 -0.20882555  0.05336465 -0.22241098 -0.11031220  0.12591051
    2014  1.49442765  1.87654149 -1.18599539  1.72865701 -0.90245650  0.19460586  0.16168719  0.16245094  1.30435313  1.27952402
    2015  0.53370893 -0.74539203 -0.47584512  0.19720682 -1.50906070 -0.21765018  1.03436621 -0.42588233 -0.15680010 -1.46725844
                 Nov         Dec
    2012  0.64720686 -0.88955517
    2013  0.53687326 -0.04852013
    2014  0.02273335  0.33675748
    2015 -0.24954432 -0.89610509
    > lag(test)
                 Jan         Feb         Mar         Apr         May         Jun         Jul         Aug         Sep         Oct
    2011                                                                                                                        
    2012 -1.44187059 -1.81896266 -1.44285425 -1.37991005 -0.49844787 -1.26719606 -0.49876644  1.89507307 -0.74584888  0.64720686
    2013  0.15779179  0.58075346  0.90677437  0.31632688 -0.20882555  0.05336465 -0.22241098 -0.11031220  0.12591051  0.53687326
    2014  1.87654149 -1.18599539  1.72865701 -0.90245650  0.19460586  0.16168719  0.16245094  1.30435313  1.27952402  0.02273335
    2015 -0.74539203 -0.47584512  0.19720682 -1.50906070 -0.21765018  1.03436621 -0.42588233 -0.15680010 -1.46725844 -0.24954432
                 Nov         Dec
    2011              0.55388567
    2012 -0.88955517  1.55083914
    2013 -0.04852013  1.49442765
    2014  0.33675748  0.53370893
    2015 -0.89610509
    

    The function is not really changing the column itself, but the time values in which it is associated. However, doing it with the "normal" dataframe in your example:

    > df$ffr
     [1] 0.12 0.09 0.09 0.07 0.09 0.09 0.10 0.11 0.12 0.13 0.16
    > lag(df$ffr)
     [1] 0.12 0.09 0.09 0.07 0.09 0.09 0.10 0.11 0.12 0.13 0.16
    attr(,"tsp")
    [1]  0 10  1
    

    You see that even if it is not a time series, lag is adding a tsp attribute to it (see ?tsp), but the values themselves do not change, nor the indexes, and that's why you see a perfect fit when you use it with lm.

    On the other hand, if you do it with the dataframe converted to time series,

    > dft[,2]
    Time Series:
    Start = 1 
    End = 11 
    Frequency = 1 
     [1] 0.12 0.09 0.09 0.07 0.09 0.09 0.10 0.11 0.12 0.13 0.16
    > lag(dft[,2])
    Time Series:
    Start = 0 
    End = 10 
    Frequency = 1 
     [1] 0.12 0.09 0.09 0.07 0.09 0.09 0.10 0.11 0.12 0.13 0.16
    

    again it is changing the metadata but not the values or the indexes, and lm does not understand the difference.

    As a side point, you could select your lag when using shift, second argument by default to 1, see ?shift.

    Hope it helps.