Search code examples
rregressionarimarolling-computationforecast

Rolling AR regressions in R


I am managing a pure monthly time seires data with 348 observations.

Here is the reproducible sample data:

library(dplyr)
set.seed(123)

Y <- cumsum(rnorm(48))
date <- as.Date(c("2012-01-01", "2012-02-01", "2012-03-01", "2012-04-01", 
                  "2012-05-01","2012-06-01", "2012-07-01", "2012-08-01",
                  "2012-09-01","2012-10-01","2012-11-01", "2012-12-01",
                  "2013-01-01", "2013-02-01","2013-03-01", "2013-04-01", 
                  "2013-05-01","2013-06-01", "2013-07-01", "2013-08-01",
                  "2013-09-01","2013-10-01","2013-11-01", "2013-12-01",
                  "2014-01-01", "2014-02-01","2014-03-01", "2014-04-01", 
                  "2014-05-01","2014-06-01", "2014-07-01", "2014-08-01",
                  "2014-09-01","2014-10-01","2014-11-01", "2014-12-01",
                  "2015-01-01", "2015-02-01", "2015-03-01", "2015-04-01", 
                  "2015-05-01","2015-06-01",  "2015-07-01", "2015-08-01",
                  "2015-09-01","2015-10-01","2015-11-01", "2015-12-01"))

data<-data.frame(date,Y)

I am replicating a paper and calculating "shock". The quoted procedure is as follows:

"Shocks in each of these series are calculated by an AR(2) model over a rolling window of 10 months that ends in month n.The shock in month n+1 denoted dYn+1 is the difference between the actual value of the series and its predicted value using the slope coefficients estimated over the preceding 10 months. Thus, our method is forward-looking, providing out-of-sample prediction errors."

The AR(2) model with a trend term according to original authors is as follows:

Yt = a0 + a1*Yt-1 + a2*Yt-2 + a3*Tt+ residualt
where Tt is the serial number of the observation, to account for a time trend in these series.

If my objective is to calculate the mean in 11st using previous 10 obs, I can simply call the following:

Mean= slide_index_dbl(Y, date, mean, .before = months(10), .after = months(-1), .complete = T)

However, in this case, the objective is to run a AR model using all previous 10 obs and use this estimated model to predict 11st one, the final ouput is the actual value in 11st minus the predicted one. Put it simply, I need to construct a function to achieve this goal instead of using "mean" function in previous example.

Once I finish this function(let's call it AR_2), we can call it inside the slider.

library(slider)

data1<-data%>%
  mutate(Shock= slide_index_dbl(Y, date, AR_2, .before = months(10), .after = months(-1), .complete = T))

   Date       Y     N  Predict    Shock
2012-01-01   0.15   1    0.2     -0.005
2012-02-01   0.4    2    0.33      0.07
2012-03-01   0.39   3    0.44     -0.05
... 
2012-10-01   1.85   10   2.1      -0.25
2012-11-01   1.7    11   1.5       0.2   
2012-12-01   3.46   12   4.1      -0.65

Let me illustrate my question using above sample data I make up. The final output is Shock, which is the difference between Y(actual data) and Predicted Y. With that being said, the question is how to get Predicted Y. Before we predict Y, we need to first train a AR model by feeding previous 10 months data. Once you get this model, you can predict 11st obs with this model and that would be Predicted Y. The final tast is to calculate the difference between it with actual Y, which is called shock.

The numerical example is as follows:in order to get output 0.074 for 2012-11-01, I need to train a AR(2) model using all previous 10 months data from 2012-01-01 to 2012-10-01, that is from 0.15 to 1.85 for Y and 1 to 10 for N. Once I've trained this model, I use it to predict new value (0.2) in 2012-11-01. The final output "Shock" is the difference between actual obs in 2012-11-01 with the predicted one (0.15-0.2=0.05).

Similarly, in order to get output 0.08 for 2012-12-01, I need to train a AR(2) model using all previous 10 months data from 2012-02-01 to 2012-11-01, that is from 0.4 to 1.7 for Y and 2 to 11 for N. Once I've trained this model, I use it to predict new value (4.1) in 2012-12-01. The final output "Shock" is the difference between actual obs in 2012-12-01 with the predicted one (3.46-4.1=-0.65).

I do not know how to write such a function(AR_2) and call it in slide_index_dbl. Please remember there is a trend term a3*Tt in AR_2, I not sure how to model that.

The following is what I have tried. I use lm to implement a AR moel instead of arima. This is because I do not know how to use arima to predict new value. If anyone are familar with arima or Arima function, go use that.

Intercept_extract_lm<-function(x){
  N<-rep(1:10)
  model<-lm(x~ lag(x,1)+ lag(x, 2)+N)
  coef(model)["(Intercept)"]
}

Log_1_extract_lm<-function(x){
  N<-rep(1:10)
  model<-lm(x~ lag(x,1)+ lag(x, 2)+N)
  coef(model)["Lag_1"]
}

Log_2_extract_lm<-function(x){
  N<-rep(1:10)
  model<-lm(x~ lag(x,1)+ lag(x, 2)+N)
  coef(model)["Lag_2"]
}


drift_extract_lm<-function(x){
  N<-rep(1:10)
  model<-lm(x~ lag(x,1)+ lag(x, 2)+N)
  coef(model)["N"]
}

data1<-data%>%
  mutate(Lag_1=lag(Y,1),Lag_2=lag(Y,2),N=1:n(),
         a1=slide_index_dbl(Y, Date, Log_1_extract_lm, .before = months(10), .after = months(-1), .complete = T),
         a2=slide_index_dbl(Y, Log_2_extract_lm, .before = months(10), .after = months(-1), .complete = T),
         drift=slide_index_dbl(Y, Date, drift_extract_lm, .before = months(10), .after = months(-1), .complete = T),
         Intercept = slide_index_dbl(Y, Date, Intercept_extract_lm, .before = months(10), .after = months(-1), .complete = T),
         Predict=Intercept+a1*Lag_1+a2*Lag_2+drift,
         Shock=Y-Predict)

I understand that the biggest problem in my code is it can take only one input argument in slider, but I am using four in all the custom functions defined (Y and lag_1 and lag_2 and N). Unlike previous example, we just calculate rolling mean of one single variable, in this case, in order to run such a regression, we need four variables in each rolling window, but slider only has one variable input. Even we edit the "extract" function to reduce four variables to two ( lag_1 and lag_2 can be writted as lag(Y,1) and lag(Y,2), we still have Y and N two variables input)

As for the trend term, from my understanding, for example, for 2012-11-01, in order to run regression, I need previous 10 minths N, that is 1,2,3....10, that is serial number of obs. For 2012-12-01, it should be also include a N from 1,2,3...10, however, based on my code, it is 2,3,4...11. But adding constant of 1 (2,3...11 vs 1,2,...10) doesn't affect regression coefficient, right? Honestly, I am not sure about trend term in this case, please feel free to change it if you understand orginal authors.


Solution

  • You may do this. Your problem of lagged 2 [A(2) Model] will be catered by order = c(2,0,0) and you don't have to do it explicitly. arima() function is used in conjunction with predict().

    library(tidyverse)
    set.seed(123)
    Y <- cumsum(rnorm(48))
    date <- as.Date(c("2012-01-01", "2012-02-01", "2012-03-01", "2012-04-01", 
                      "2012-05-01","2012-06-01", "2012-07-01", "2012-08-01",
                      "2012-09-01","2012-10-01","2012-11-01", "2012-12-01",
                      "2013-01-01", "2013-02-01","2013-03-01", "2013-04-01", 
                      "2013-05-01","2013-06-01", "2013-07-01", "2013-08-01",
                      "2013-09-01","2013-10-01","2013-11-01", "2013-12-01",
                      "2014-01-01", "2014-02-01","2014-03-01", "2014-04-01", 
                      "2014-05-01","2014-06-01", "2014-07-01", "2014-08-01",
                      "2014-09-01","2014-10-01","2014-11-01", "2014-12-01",
                      "2015-01-01", "2015-02-01", "2015-03-01", "2015-04-01", 
                      "2015-05-01","2015-06-01",  "2015-07-01", "2015-08-01",
                      "2015-09-01","2015-10-01","2015-11-01", "2015-12-01"))
    
    data <- data.frame(date,Y)
    
    library(lubridate, warn.conflicts = F)
    library(slider)
    
    data %>% 
      mutate(pred_roll_10 = slide_index_dbl(Y, .i = date, 
                                            ~ predict(arima(.x, c(2,0,0), method = 'ML'),1)$pred[1], 
                                            .before = months(10),
                                            .after = months(-1),
                                            .complete = T),
             shock = Y - pred_roll_10)
    #> Warning in log(s2): NaNs produced
    
    #> Warning in log(s2): NaNs produced
    
    #> Warning in log(s2): NaNs produced
    
    #> Warning in log(s2): NaNs produced
    #>          date          Y pred_roll_10        shock
    #> 1  2012-01-01 -0.5604756           NA           NA
    #> 2  2012-02-01 -0.7906531           NA           NA
    #> 3  2012-03-01  0.7680552           NA           NA
    #> 4  2012-04-01  0.8385636           NA           NA
    #> 5  2012-05-01  0.9678513           NA           NA
    #> 6  2012-06-01  2.6829163           NA           NA
    #> 7  2012-07-01  3.1438325           NA           NA
    #> 8  2012-08-01  1.8787713           NA           NA
    #> 9  2012-09-01  1.1919184           NA           NA
    #> 10 2012-10-01  0.7462564           NA           NA
    #> 11 2012-11-01  1.9703382    0.6951079  1.275230339
    #> 12 2012-12-01  2.3301521    2.0524671  0.277684986
    #> 13 2013-01-01  2.7309235    1.9333295  0.797594017
    #> 14 2013-02-01  2.8416062    2.0486834  0.792922863
    #> 15 2013-03-01  2.2857651    1.9959595  0.289805588
    #> 16 2013-04-01  4.0726782    1.7514948  2.321183420
    #> 17 2013-05-01  4.5705287    3.6469980  0.923530738
    #> 18 2013-06-01  2.6039116    4.0585260 -1.454614411
    #> 19 2013-07-01  3.3052675    1.6480334  1.657234048
    #> 20 2013-08-01  2.8324760    2.9543635 -0.121887466
    #> 21 2013-09-01  1.7646523    2.8739943 -1.109341911
    #> 22 2013-10-01  1.5466774    2.7845341 -1.237856702
    #> 23 2013-11-01  0.5206730    2.5894221 -2.068749114
    #> 24 2013-12-01 -0.2082182    1.3652172 -1.573435497
    #> 25 2014-01-01 -0.8332575    0.3654956 -1.198753080
    #> 26 2014-02-01 -2.5199508   -0.5625776 -1.957373251
    #> 27 2014-03-01 -1.6821638   -2.6388755  0.956711703
    #> 28 2014-04-01 -1.5287907   -0.7131115 -0.815679154
    #> 29 2014-05-01 -2.6669276   -1.2140023 -1.452925253
    #> 30 2014-06-01 -1.4131127   -2.6421098  1.228997135
    #> 31 2014-07-01 -0.9866485   -1.0140132  0.027364706
    #> 32 2014-08-01 -1.2817199   -0.8209904 -0.460729511
    #> 33 2014-09-01 -0.3865943   -1.2655968  0.879002504
    #> 34 2014-10-01  0.4915392   -1.1977841  1.689323273
    #> 35 2014-11-01  1.3131203   -0.4463373  1.759457622
    #> 36 2014-12-01  2.0017605    0.9931931  1.008567426
    #> 37 2015-01-01  2.5556782    1.7819117  0.773766509
    #> 38 2015-02-01  2.4937665    2.3699293  0.123837184
    #> 39 2015-03-01  2.1878038    2.1804620  0.007341787
    #> 40 2015-04-01  1.8073328    1.5452418  0.262091026
    #> 41 2015-05-01  1.1126258    1.2644697 -0.151843849
    #> 42 2015-06-01  0.9047086    0.2974574  0.607251157
    #> 43 2015-07-01 -0.3606878    0.7761950 -1.136882824
    #> 44 2015-08-01  1.8082682   -1.1719233  2.980191448
    #> 45 2015-09-01  3.0162302    1.9584189  1.057811310
    #> 46 2015-10-01  1.8931216    2.5356767 -0.642555104
    #> 47 2015-11-01  1.4902368    1.2115509  0.278685838
    #> 48 2015-12-01  1.0235814    1.4319100 -0.408328595
    

    Created on 2021-07-08 by the reprex package (v2.0.0)