Search code examples
rtime-seriesregressionfinance

Run lagged regressions for finance data


I am trying to run lagged regression for each of the 30 assets in my data frame.

My table looks like this:

 date          rx1   rx2   rx3  
    1930-01-31  0     0     0
    1930-02-28  0     0    -0.3
    1930-03-31  0     0    -0.1
    ....              -0.1  ...
    1975-02-28  -0.4  -0.2  ...
    2016-12-31  -0.03  ...  ...

Then I tried to run lagged regressions with this line(dataframe:Rx3.df):

model <- dyn$lm(Rx3.df$rx1 ~ lag(Rx3.df$rx1, 1) + lag(Rx3.df$rx1, 2))

But I kept getting this result that does not make any sense:

Call:
lm(formula = dyn(Rx3.df$rx1 ~ lag(Rx3.df$rx1, 1) + lag(Rx3.df$rx1, 
    2)))

Coefficients:
       (Intercept)  lag(Rx3.df$rx1, 1)  lag(Rx3.df$rx1, 2)  
         3.297e-16           1.000e+00                  NA 

Can anyone figure out the problem? Thanks!


Solution

  • I apologize in advance for the long answer. It seems that for some reason (NO IDEA WHY?) dyn$lm doesn't omit the NAs. If for, example you type summary(model), you'll get this warning message:

    Warning message:
    In summary.lm(lm(rx1 ~ lag(rx1, 1), data = df, model = T)) :
      essentially perfect fit: summary may be unreliable
    

    Also, if you type nobs(model), you will get the the same as nrow(Rx3.df), which shouldn't happen because you'll be dropping at least 1 observation per lag.

    I recreated part of your data as follows:

    > df<- data.frame(rx1 = runif(1000, 1, 100))
    > head(df, 5)
           rx1
    1 56.63239
    2 89.99562
    3 37.35498
    4  7.47771
    5 92.77819
    

    And as in your case, I get:

    > summary(dyn$lm(rx1~lag(rx1, 1) + lag(rx1, 2), data=df))
    
    Call:
    lm(formula = dyn(rx1 ~ lag(rx1, 1) + lag(rx1, 2)), data = df)
    
    Residuals:
           Min         1Q     Median         3Q        Max 
    -3.982e-13 -5.400e-16  3.600e-16  1.230e-15  1.211e-14 
    
    Coefficients: (1 not defined because of singularities)
                 Estimate Std. Error   t value Pr(>|t|)    
    (Intercept) 2.876e-14  8.219e-16 3.499e+01   <2e-16 ***
    lag(rx1, 1) 1.000e+00  1.424e-17 7.024e+16   <2e-16 ***
    lag(rx1, 2)        NA         NA        NA       NA    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 1.275e-14 on 998 degrees of freedom
    Multiple R-squared:      1, Adjusted R-squared:      1 
    F-statistic: 4.934e+33 on 1 and 998 DF,  p-value: < 2.2e-16
    
    Warning message:
    In summary.lm(dyn$lm(rx1 ~ lag(rx1, 1) + lag(rx1, 2), data = df)) :
      essentially perfect fit: summary may be unreliable
    

    However, if I lag the variables manually:

    > library(dplyr)
    > df_lag<- mutate(df, lag1_rx1 = lag(rx1), lag2_rx1 = lag(rx1, 2))
    > head(df_lag, 5)
           rx1 lag1_rx1 lag2_rx1
    1 56.63239       NA       NA
    2 89.99562 56.63239       NA
    3 37.35498 89.99562 56.63239
    4  7.47771 37.35498 89.99562
    5 92.77819  7.47771 37.35498
    

    The results are correct:

    > summary(dyn$lm(rx1~lag1_rx1+lag2_rx1, data=df_lag))
    
    Call:
    lm(formula = dyn(rx1 ~ lag1_rx1 + lag2_rx1), data = df_lag)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -50.325 -23.271  -0.471  24.763  50.864 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 50.84681    2.46790  20.603   <2e-16 ***
    lag1_rx1    -0.03664    0.03170  -1.156    0.248    
    lag2_rx1     0.02494    0.03170   0.787    0.432    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 28.34 on 995 degrees of freedom
      (2 observations deleted due to missingness)
    Multiple R-squared:  0.002033,  Adjusted R-squared:  2.741e-05 
    F-statistic: 1.014 on 2 and 995 DF,  p-value: 0.3633
    

    Note that in the second case, right under the Residual standard error, I get a notification that (2 observations deleted due to missingness), which should happen! This is also mirrored on the degrees of freedom, which should be equal to the number of observations you have, minus the number of parameters you're estimating.

    I hope this helps!

    MAJOR EDIT for multiple variables You will need to use data.table for this solution. I also created a new df to match the example you gave.

    library(data.table)
    > df<- data.table(date= sample(seq(as.Date('1900/01/01'), as.Date('2000/01/01'), by="day"), 1000), rx1 = runif(1000, 1, 100), rx2 = rnorm(1000), rx3 = rpois(1000, 1))
    > head(df, 5)
             date       rx1         rx2 rx3
    1: 1989-01-16  9.642232 -1.14117351   0
    2: 1964-05-10 55.946293  1.21938225   1
    3: 1911-11-24  8.609234 -0.77489439   1
    4: 1914-09-29 57.253969  0.02277709   0
    5: 1902-08-09 69.454322 -0.10850359   1
    

    The next two pieces of code create the lags:

    > df[, paste0("lag1_", names(df)[like(names(df), "^rx")]) := shift(.SD), .SDcols = like(names(df), "^rx")]
    > df[, paste0("lag2_", names(df)[like(names(df), "^rx")]) := shift(.SD, 2), .SDcols = like(names(df), "^rx")]
    > head(df, 5)
             date       rx1         rx2 rx3  lag1_rx1    lag1_rx2 lag1_rx3  lag2_rx1   lag2_rx2
    1: 1989-01-16  9.642232 -1.14117351   0        NA          NA       NA        NA         NA
    2: 1964-05-10 55.946293  1.21938225   1  9.642232 -1.14117351        0        NA         NA
    3: 1911-11-24  8.609234 -0.77489439   1 55.946293  1.21938225        1  9.642232 -1.1411735
    4: 1914-09-29 57.253969  0.02277709   0  8.609234 -0.77489439        1 55.946293  1.2193823
    5: 1902-08-09 69.454322 -0.10850359   1 57.253969  0.02277709        0  8.609234 -0.7748944
       lag2_rx3
    1:       NA
    2:       NA
    3:        0
    4:        1
    5:        1
    

    And finally the results:

    > formula<- paste0(paste0("rx", 1:3),"~", paste0("lag1_rx", 1:3), "+", paste0("lag2_rx", 1:3))
    > results<- lapply(formula, function(x) lm(x, data=df))
    > names(results)<- paste0("rx", 1:3)
    

    Obviously, change the 3 above to how ever many variables you have. I'm not providing the output because it's too big. For rx1 you can access the results by typing summary(results[["rx1"]].

    PS: It doesn't seem to make a difference if you use dyn$lm compared to regular old lm.