Search code examples
rtidyverseforecastingtsibbletidyverts

Forecasting irregular stock data with ARIMA and tsibble


I want to forecast a certain stock using ARIMA in a similar way that R. Hyndman does it in FPP3.

The first issue that I've run into is that stock data is obviously irregular, since the stock exchange is closed during weekends and some holidays. This creates some issues if I want to use functions from the tidyverts packages:

> stock
# A tsibble: 750 x 6 [1D]
   Date        Open  High   Low Close Volume
   <date>     <dbl> <dbl> <dbl> <dbl>  <dbl>
 1 2019-05-21  36.3  36.4  36.3  36.4    232
 2 2019-05-22  36.4  37.0  36.4  36.8   1007
 3 2019-05-23  36.7  36.8  36.1  36.1   4298
 4 2019-05-24  36.4  36.5  36.4  36.4    452
 5 2019-05-27  36.5  36.5  36.3  36.4   2032
 6 2019-05-28  36.5  36.8  36.4  36.5   3049
 7 2019-05-29  36.2  36.5  36.1  36.5   2962
 8 2019-05-30  36.8  37.1  36.8  37.1    432
 9 2019-05-31  36.8  37.4  36.8  37.4   8424
10 2019-06-03  37.3  37.5  37.2  37.3   1550
# ... with 740 more rows


> stock %>%
+ feasts::ACF(difference(Close)) %>%
+ autoplot()

Error in `check_gaps()`:
! .data contains implicit gaps in time. You should check your data and convert implicit gaps into explicit missing values using `tsibble::fill_gaps()` if required.

The same error regarding gaps in time applies to other functions like fable::ARIMA() or feasts::gg_tsdisplay().

I have tried filling the gaps with values from previous rows:

stock %>%
  group_by_key() %>%
  fill_gaps() %>%
  tidyr::fill(Close, .direction = "down")

# A tsibble: 1,096 x 6 [1D]
   Date        Open  High   Low Close Volume
   <date>     <dbl> <dbl> <dbl> <dbl>  <dbl>
 1 2019-05-21  36.3  36.4  36.3  36.4    232
 2 2019-05-22  36.4  37.0  36.4  36.8   1007
 3 2019-05-23  36.7  36.8  36.1  36.1   4298
 4 2019-05-24  36.4  36.5  36.4  36.4    452
 5 2019-05-25  NA    NA    NA    36.4     NA
 6 2019-05-26  NA    NA    NA    36.4     NA
 7 2019-05-27  36.5  36.5  36.3  36.4   2032
 8 2019-05-28  36.5  36.8  36.4  36.5   3049
 9 2019-05-29  36.2  36.5  36.1  36.5   2962
10 2019-05-30  36.8  37.1  36.8  37.1    432
# ... with 1,086 more rows

and everything works as it should from there. My question is:

  • Is there a way to use the "tidyverts approach" without running into the issue regarding gaps in time?
  • If not, is filling the gaps with values from previous rows a correct way to overcome this or will it bias the model?

Solution

  • First, you're clearly using an old version of the feasts package, because the current version gives a warning rather than an error when computing the ACF from data with implicit gaps.

    Second, the answer depends on what analysis you want to do. You have three choices:

    1. use day as the time index and fill the gaps with NAs;
    2. use day as the time index and fill the gaps with the previous closing stock prices;
    3. use trading day as the time index, in which case there are no gaps.

    Here are the results for each of them, using an example of Apple stock over the period 2014-2018.

    library(fpp3)
    #> ── Attaching packages ─────────────────────────────────────── fpp3 0.4.0.9000 ──
    #> ✔ tibble      3.1.7     ✔ tsibble     1.1.1
    #> ✔ dplyr       1.0.9     ✔ tsibbledata 0.4.0
    #> ✔ tidyr       1.2.0     ✔ feasts      0.2.2
    #> ✔ lubridate   1.8.0     ✔ fable       0.3.1
    #> ✔ ggplot2     3.3.6     ✔ fabletools  0.3.2
    #> ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
    #> ✖ lubridate::date()    masks base::date()
    #> ✖ dplyr::filter()      masks stats::filter()
    #> ✖ tsibble::intersect() masks base::intersect()
    #> ✖ tsibble::interval()  masks lubridate::interval()
    #> ✖ dplyr::lag()         masks stats::lag()
    #> ✖ tsibble::setdiff()   masks base::setdiff()
    #> ✖ tsibble::union()     masks base::union()
    

    1. Fill non-trading days with missing values

    stock <- gafa_stock %>%
      filter(Symbol == "AAPL") %>%
      tsibble(index = Date, regular = TRUE) %>%
      fill_gaps()
    stock
    #> # A tsibble: 1,825 x 8 [1D]
    #>    Symbol Date        Open  High   Low Close Adj_Close    Volume
    #>    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>
    #>  1 AAPL   2014-01-02  79.4  79.6  78.9  79.0      67.0  58671200
    #>  2 AAPL   2014-01-03  79.0  79.1  77.2  77.3      65.5  98116900
    #>  3 <NA>   2014-01-04  NA    NA    NA    NA        NA          NA
    #>  4 <NA>   2014-01-05  NA    NA    NA    NA        NA          NA
    #>  5 AAPL   2014-01-06  76.8  78.1  76.2  77.7      65.9 103152700
    #>  6 AAPL   2014-01-07  77.8  78.0  76.8  77.1      65.4  79302300
    #>  7 AAPL   2014-01-08  77.0  77.9  77.0  77.6      65.8  64632400
    #>  8 AAPL   2014-01-09  78.1  78.1  76.5  76.6      65.0  69787200
    #>  9 AAPL   2014-01-10  77.1  77.3  75.9  76.1      64.5  76244000
    #> 10 <NA>   2014-01-11  NA    NA    NA    NA        NA          NA
    #> # … with 1,815 more rows
    
    stock %>%
      model(ARIMA(Close ~ pdq(d=1)))
    #> A mable: 1 x 1
    #>  `ARIMA(Close ~ pdq(d = 1))`
    #>                      <model>
    #> 1              <ARIMA(0,1,0)>
    

    In this case, calculations of the ACF will find the longest contiguous part which is too small to be meaningful, so there isn't any point showing the results of ACF() or gg_tsdisplay(). Also, the automated choice of differencing in the ARIMA model fails due to the missing values, so I have manually set it to one. The other parts of the ARIMA model work fine in the presence of missing values.

    2. Fill non-trading days with the last observed values

    stock <- stock %>%
      tidyr::fill(Close, .direction = "down")
    stock
    #> # A tsibble: 1,825 x 8 [1D]
    #>    Symbol Date        Open  High   Low Close Adj_Close    Volume
    #>    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>
    #>  1 AAPL   2014-01-02  79.4  79.6  78.9  79.0      67.0  58671200
    #>  2 AAPL   2014-01-03  79.0  79.1  77.2  77.3      65.5  98116900
    #>  3 <NA>   2014-01-04  NA    NA    NA    77.3      NA          NA
    #>  4 <NA>   2014-01-05  NA    NA    NA    77.3      NA          NA
    #>  5 AAPL   2014-01-06  76.8  78.1  76.2  77.7      65.9 103152700
    #>  6 AAPL   2014-01-07  77.8  78.0  76.8  77.1      65.4  79302300
    #>  7 AAPL   2014-01-08  77.0  77.9  77.0  77.6      65.8  64632400
    #>  8 AAPL   2014-01-09  78.1  78.1  76.5  76.6      65.0  69787200
    #>  9 AAPL   2014-01-10  77.1  77.3  75.9  76.1      64.5  76244000
    #> 10 <NA>   2014-01-11  NA    NA    NA    76.1      NA          NA
    #> # … with 1,815 more rows
    
    stock %>%
      ACF(difference(Close)) %>%
      autoplot()
    

    stock %>%
      model(ARIMA(Close))
    #> # A mable: 1 x 1
    #>   `ARIMA(Close)`
    #>          <model>
    #> 1 <ARIMA(0,1,0)>
    
    stock %>%
      gg_tsdisplay(Close)
    

    3. Re-index by trading day

    stock <- gafa_stock %>%
      filter(Symbol == "AAPL") %>%
      tsibble(index = Date, regular = TRUE) %>%
      mutate(trading_day = row_number()) %>%
      tsibble(index = trading_day)
    stock
    #> # A tsibble: 1,258 x 9 [1]
    #>    Symbol Date        Open  High   Low Close Adj_Close    Volume trading_day
    #>    <chr>  <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>       <int>
    #>  1 AAPL   2014-01-02  79.4  79.6  78.9  79.0      67.0  58671200           1
    #>  2 AAPL   2014-01-03  79.0  79.1  77.2  77.3      65.5  98116900           2
    #>  3 AAPL   2014-01-06  76.8  78.1  76.2  77.7      65.9 103152700           3
    #>  4 AAPL   2014-01-07  77.8  78.0  76.8  77.1      65.4  79302300           4
    #>  5 AAPL   2014-01-08  77.0  77.9  77.0  77.6      65.8  64632400           5
    #>  6 AAPL   2014-01-09  78.1  78.1  76.5  76.6      65.0  69787200           6
    #>  7 AAPL   2014-01-10  77.1  77.3  75.9  76.1      64.5  76244000           7
    #>  8 AAPL   2014-01-13  75.7  77.5  75.7  76.5      64.9  94623200           8
    #>  9 AAPL   2014-01-14  76.9  78.1  76.8  78.1      66.1  83140400           9
    #> 10 AAPL   2014-01-15  79.1  80.0  78.8  79.6      67.5  97909700          10
    #> # … with 1,248 more rows
    
    stock %>%
      ACF(difference(Close)) %>%
      autoplot()
    

    stock %>%
      model(ARIMA(Close))
    #> # A mable: 1 x 1
    #>   `ARIMA(Close)`
    #>          <model>
    #> 1 <ARIMA(2,1,3)>
    
    stock %>%
      gg_tsdisplay(Close)
    

    Created on 2022-05-22 by the reprex package (v2.0.1)