Run lagged regressions for finance data

I am trying to run lagged regression for each of the 30 assets in my data frame.

My table looks like this:

 date          rx1   rx2   rx3  
    1930-01-31  0     0     0
    1930-02-28  0     0    -0.3
    1930-03-31  0     0    -0.1
    ....              -0.1  ...
    1975-02-28  -0.4  -0.2  ...
    2016-12-31  -0.03  ...  ...

Then I tried to run lagged regressions with this line(dataframe:Rx3.df):

model <- dyn$lm(Rx3.df$rx1 ~ lag(Rx3.df$rx1, 1) + lag(Rx3.df$rx1, 2))

But I kept getting this result that does not make any sense:

Call:
lm(formula = dyn(Rx3.df$rx1 ~ lag(Rx3.df$rx1, 1) + lag(Rx3.df$rx1, 
    2)))

Coefficients:
       (Intercept)  lag(Rx3.df$rx1, 1)  lag(Rx3.df$rx1, 2)  
         3.297e-16           1.000e+00                  NA

Can anyone figure out the problem? Thanks!

Solution

I apologize in advance for the long answer. It seems that for some reason (NO IDEA WHY?) dyn$lm doesn't omit the NAs. If for, example you type summary(model), you'll get this warning message:

Warning message:
In summary.lm(lm(rx1 ~ lag(rx1, 1), data = df, model = T)) :
  essentially perfect fit: summary may be unreliable

Also, if you type nobs(model), you will get the the same as nrow(Rx3.df), which shouldn't happen because you'll be dropping at least 1 observation per lag.

I recreated part of your data as follows:

> df<- data.frame(rx1 = runif(1000, 1, 100))
> head(df, 5)
       rx1
1 56.63239
2 89.99562
3 37.35498
4  7.47771
5 92.77819

And as in your case, I get:

> summary(dyn$lm(rx1~lag(rx1, 1) + lag(rx1, 2), data=df))

Call:
lm(formula = dyn(rx1 ~ lag(rx1, 1) + lag(rx1, 2)), data = df)

Residuals:
       Min         1Q     Median         3Q        Max 
-3.982e-13 -5.400e-16  3.600e-16  1.230e-15  1.211e-14 

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error   t value Pr(>|t|)    
(Intercept) 2.876e-14  8.219e-16 3.499e+01   <2e-16 ***
lag(rx1, 1) 1.000e+00  1.424e-17 7.024e+16   <2e-16 ***
lag(rx1, 2)        NA         NA        NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.275e-14 on 998 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 4.934e+33 on 1 and 998 DF,  p-value: < 2.2e-16

Warning message:
In summary.lm(dyn$lm(rx1 ~ lag(rx1, 1) + lag(rx1, 2), data = df)) :
  essentially perfect fit: summary may be unreliable

However, if I lag the variables manually:

> library(dplyr)
> df_lag<- mutate(df, lag1_rx1 = lag(rx1), lag2_rx1 = lag(rx1, 2))
> head(df_lag, 5)
       rx1 lag1_rx1 lag2_rx1
1 56.63239       NA       NA
2 89.99562 56.63239       NA
3 37.35498 89.99562 56.63239
4  7.47771 37.35498 89.99562
5 92.77819  7.47771 37.35498

The results are correct:

> summary(dyn$lm(rx1~lag1_rx1+lag2_rx1, data=df_lag))

Call:
lm(formula = dyn(rx1 ~ lag1_rx1 + lag2_rx1), data = df_lag)

Residuals:
    Min      1Q  Median      3Q     Max 
-50.325 -23.271  -0.471  24.763  50.864 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 50.84681    2.46790  20.603   <2e-16 ***
lag1_rx1    -0.03664    0.03170  -1.156    0.248    
lag2_rx1     0.02494    0.03170   0.787    0.432    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 28.34 on 995 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.002033,  Adjusted R-squared:  2.741e-05 
F-statistic: 1.014 on 2 and 995 DF,  p-value: 0.3633

Note that in the second case, right under the Residual standard error, I get a notification that (2 observations deleted due to missingness), which should happen! This is also mirrored on the degrees of freedom, which should be equal to the number of observations you have, minus the number of parameters you're estimating.

I hope this helps!

MAJOR EDIT for multiple variables You will need to use data.table for this solution. I also created a new df to match the example you gave.

library(data.table)
> df<- data.table(date= sample(seq(as.Date('1900/01/01'), as.Date('2000/01/01'), by="day"), 1000), rx1 = runif(1000, 1, 100), rx2 = rnorm(1000), rx3 = rpois(1000, 1))
> head(df, 5)
         date       rx1         rx2 rx3
1: 1989-01-16  9.642232 -1.14117351   0
2: 1964-05-10 55.946293  1.21938225   1
3: 1911-11-24  8.609234 -0.77489439   1
4: 1914-09-29 57.253969  0.02277709   0
5: 1902-08-09 69.454322 -0.10850359   1

The next two pieces of code create the lags:

> df[, paste0("lag1_", names(df)[like(names(df), "^rx")]) := shift(.SD), .SDcols = like(names(df), "^rx")]
> df[, paste0("lag2_", names(df)[like(names(df), "^rx")]) := shift(.SD, 2), .SDcols = like(names(df), "^rx")]
> head(df, 5)
         date       rx1         rx2 rx3  lag1_rx1    lag1_rx2 lag1_rx3  lag2_rx1   lag2_rx2
1: 1989-01-16  9.642232 -1.14117351   0        NA          NA       NA        NA         NA
2: 1964-05-10 55.946293  1.21938225   1  9.642232 -1.14117351        0        NA         NA
3: 1911-11-24  8.609234 -0.77489439   1 55.946293  1.21938225        1  9.642232 -1.1411735
4: 1914-09-29 57.253969  0.02277709   0  8.609234 -0.77489439        1 55.946293  1.2193823
5: 1902-08-09 69.454322 -0.10850359   1 57.253969  0.02277709        0  8.609234 -0.7748944
   lag2_rx3
1:       NA
2:       NA
3:        0
4:        1
5:        1

And finally the results:

> formula<- paste0(paste0("rx", 1:3),"~", paste0("lag1_rx", 1:3), "+", paste0("lag2_rx", 1:3))
> results<- lapply(formula, function(x) lm(x, data=df))
> names(results)<- paste0("rx", 1:3)

Obviously, change the 3 above to how ever many variables you have. I'm not providing the output because it's too big. For rx1 you can access the results by typing summary(results[["rx1"]].

PS: It doesn't seem to make a difference if you use dyn$lm compared to regular old lm.