I have imported and stored in a dataframe some variables to perfom basic regressions and statistical analysis. Starting from the time series of these variables I built up my DF and attached to it also a Date variable, to have a clear reference time when plotting. The DF looks broadly like this (I take just a random part):
time ffr cpi gap
266 2013-04-01 0.12 0.75 -4.17
267 2013-07-01 0.09 1.90 -3.85
268 2013-10-01 0.09 1.28 -3.34
269 2014-01-01 0.07 1.32 -3.94
270 2014-04-01 0.09 1.98 -3.24
271 2014-07-01 0.09 1.31 -2.60
272 2014-10-01 0.10 -0.02 -2.47
273 2015-01-01 0.11 -0.06 -2.68
274 2015-04-01 0.12 2.02 -2.10
275 2015-07-01 0.13 1.24 -1.98
276 2015-10-01 0.16 0.78 -2.11
Now, when I run a simple regression like
reg1<-lm(df, ffr ~ cpi + gap)
everything works fine with expected results. But when I try a sligthly more sophisticated model with an autoregressive part, lags and forwards, things gets quite messy, and the solutions I found on the Web do not seem to work in my case. Below are some exemples:
reg2<-lm(df, ffr ~ cpi + gap + lag(ffr))
this gives a perfect fit, because what actually happens is that ffr is regressed on iteself without lags. Then I follow what I find elsewhere and turn the dataframe in the time series format, by
df<-xts(df, order.by=df$time)
and then
reg3<-lm(df, ffr ~ cpi + gap + lag(ffr))
which actually gives super strange results, since it appears -- in my understanding -- that all the observations of cpi, gap and ffr are used as variables. Here the ouput of the regression
Call:
lm(formula = ffr ~ cpi + gap + lag(ffr), data = small2)
Residuals:
ALL 11 residuals are 0: no residual degrees of freedom!
Coefficients: (16 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3 NA NA NA
cpi-0.06 1 NA NA NA
cpi 0.75 2 NA NA NA
cpi 0.78 4 NA NA NA
cpi 1.24 3 NA NA NA
cpi 1.28 -1 NA NA NA
cpi 1.31 -1 NA NA NA
cpi 1.32 -2 NA NA NA
cpi 1.90 -1 NA NA NA
cpi 1.98 -1 NA NA NA
cpi 2.02 2 NA NA NA
gap-2.10 NA NA NA NA
gap-2.11 NA NA NA NA
gap-2.47 NA NA NA NA
gap-2.60 NA NA NA NA
gap-2.68 NA NA NA NA
gap-3.24 NA NA NA NA
gap-3.34 NA NA NA NA
gap-3.85 NA NA NA NA
gap-3.94 NA NA NA NA
gap-4.17 NA NA NA NA
lag(ffr)0.09 NA NA NA NA
lag(ffr)0.10 NA NA NA NA
lag(ffr)0.11 NA NA NA NA
lag(ffr)0.12 NA NA NA NA
lag(ffr)0.13 NA NA NA NA
lag(ffr)0.16 NA NA NA NA
Residual standard error: NA on 0 degrees of freedom
Multiple R-squared: NA, Adjusted R-squared: NA
F-statistic: NA on 10 and 0 DF, p-value: NA
and the following warnings
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
3: In Ops.factor(r, 2) : ‘^’ not meaningful for factors
The same applies when using zoo
instead of xts
. Then I try the dyn
package, with the data being both in the form of a dataframe and a xts/zoo object: nothing works and I get a perfect fit and usual errors, respectively. Using the package dynlm
, nothing changes. Any hints or ideas about what is goin on?
Ah, after transforming the orginal dataframe in xts, it looks like this
time ffr cpi gap
2013-04-01 "2013-04-01" "0.12" " 0.75" "-4.17"
2013-07-01 "2013-07-01" "0.09" " 1.90" "-3.85"
2013-10-01 "2013-10-01" "0.09" " 1.28" "-3.34"
2014-01-01 "2014-01-01" "0.07" " 1.32" "-3.94"
2014-04-01 "2014-04-01" "0.09" " 1.98" "-3.24"
2014-07-01 "2014-07-01" "0.09" " 1.31" "-2.60"
2014-10-01 "2014-10-01" "0.10" "-0.02" "-2.47"
2015-01-01 "2015-01-01" "0.11" "-0.06" "-2.68"
2015-04-01 "2015-04-01" "0.12" " 2.02" "-2.10"
2015-07-01 "2015-07-01" "0.13" " 1.24" "-1.98"
2015-10-01 "2015-10-01" "0.16" " 0.78" "-2.11"
So I wonder if the whole problem is that the transformation fails to convert the DF.
You could simply calculate the lag yourself, using shift
to add a new column to your dataframe:
df$lag1 <- shift(df$ffr)
reg3<-lm(ffr ~ cpi + gap + lag1, df)
Result using your 11 rows:
> summary(reg3)
Call:
lm(formula = ffr ~ cpi + gap + lag1, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.012355 -0.006234 -0.004345 0.003007 0.019277
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0983353 0.0362563 2.712 0.0350 *
cpi -0.0009486 0.0058926 -0.161 0.8774
gap 0.0215892 0.0066774 3.233 0.0178 *
lag1 0.6821619 0.2476126 2.755 0.0331 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01254 on 6 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.844, Adjusted R-squared: 0.7659
F-statistic: 10.82 on 3 and 6 DF, p-value: 0.007808
Alternatively, converting to time series and using dynlm
:
dft <- as.ts(df)
library(dynlm)
reg4 <- dynlm(ffr ~ cpi + gap + L(ffr,1), dft)
Results:
> summary(reg4)
Time series regression with "ts" data:
Start = 2, End = 11
Call:
dynlm(formula = ffr ~ cpi + gap + L(ffr, 1), data = dft)
Residuals:
Min 1Q Median 3Q Max
-0.012355 -0.006234 -0.004345 0.003007 0.019277
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0983353 0.0362563 2.712 0.0350 *
cpi -0.0009486 0.0058926 -0.161 0.8774
gap 0.0215892 0.0066774 3.233 0.0178 *
L(ffr, 1) 0.6821619 0.2476126 2.755 0.0331 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01254 on 6 degrees of freedom
Multiple R-squared: 0.844, Adjusted R-squared: 0.7659
F-statistic: 10.82 on 3 and 6 DF, p-value: 0.007808
Hope it helps.
EDIT after comments: some clarifications on why lag
did not work.
Maybe you will see more clearly how lag
works in a time series with this toy example in which the series has proper time values:
> test <- ts(rnorm(48), start=c(2012), frequency=12)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct
2012 0.55388567 -1.44187059 -1.81896266 -1.44285425 -1.37991005 -0.49844787 -1.26719606 -0.49876644 1.89507307 -0.74584888
2013 1.55083914 0.15779179 0.58075346 0.90677437 0.31632688 -0.20882555 0.05336465 -0.22241098 -0.11031220 0.12591051
2014 1.49442765 1.87654149 -1.18599539 1.72865701 -0.90245650 0.19460586 0.16168719 0.16245094 1.30435313 1.27952402
2015 0.53370893 -0.74539203 -0.47584512 0.19720682 -1.50906070 -0.21765018 1.03436621 -0.42588233 -0.15680010 -1.46725844
Nov Dec
2012 0.64720686 -0.88955517
2013 0.53687326 -0.04852013
2014 0.02273335 0.33675748
2015 -0.24954432 -0.89610509
> lag(test)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct
2011
2012 -1.44187059 -1.81896266 -1.44285425 -1.37991005 -0.49844787 -1.26719606 -0.49876644 1.89507307 -0.74584888 0.64720686
2013 0.15779179 0.58075346 0.90677437 0.31632688 -0.20882555 0.05336465 -0.22241098 -0.11031220 0.12591051 0.53687326
2014 1.87654149 -1.18599539 1.72865701 -0.90245650 0.19460586 0.16168719 0.16245094 1.30435313 1.27952402 0.02273335
2015 -0.74539203 -0.47584512 0.19720682 -1.50906070 -0.21765018 1.03436621 -0.42588233 -0.15680010 -1.46725844 -0.24954432
Nov Dec
2011 0.55388567
2012 -0.88955517 1.55083914
2013 -0.04852013 1.49442765
2014 0.33675748 0.53370893
2015 -0.89610509
The function is not really changing the column itself, but the time values in which it is associated. However, doing it with the "normal" dataframe in your example:
> df$ffr
[1] 0.12 0.09 0.09 0.07 0.09 0.09 0.10 0.11 0.12 0.13 0.16
> lag(df$ffr)
[1] 0.12 0.09 0.09 0.07 0.09 0.09 0.10 0.11 0.12 0.13 0.16
attr(,"tsp")
[1] 0 10 1
You see that even if it is not a time series, lag
is adding a tsp
attribute to it (see ?tsp
), but the values themselves do not change, nor the indexes, and that's why you see a perfect fit when you use it with lm
.
On the other hand, if you do it with the dataframe converted to time series,
> dft[,2]
Time Series:
Start = 1
End = 11
Frequency = 1
[1] 0.12 0.09 0.09 0.07 0.09 0.09 0.10 0.11 0.12 0.13 0.16
> lag(dft[,2])
Time Series:
Start = 0
End = 10
Frequency = 1
[1] 0.12 0.09 0.09 0.07 0.09 0.09 0.10 0.11 0.12 0.13 0.16
again it is changing the metadata but not the values or the indexes, and lm
does not understand the difference.
As a side point, you could select your lag when using shift
, second argument by default to 1, see ?shift
.
Hope it helps.