Search code examples
rtsibble

R time series forecasting trend line is wrong because of the indexing


I'm trying to figure out the answer to a question about forecasting.

I'm using the tsibble package and following a textbook written by Hyndman and Athanasopoulos. It's a great textbook but also above my weight. https://otexts.com/fpp3/

Here is the first 100 rows of my data as a tsibble:

structure(list(X = 1:100, Time = structure(c(1546351200, 1546354800, 
1546358400, 1546362000, 1546365600, 1546369200, 1546372800, 1546376400, 
1546380000, 1546383600, 1546387200, 1546390800, 1546437600, 1546441200, 
1546444800, 1546448400, 1546452000, 1546455600, 1546459200, 1546462800, 
1546466400, 1546470000, 1546473600, 1546477200, 1546524000, 1546527600, 
1546531200, 1546534800, 1546538400, 1546542000, 1546545600, 1546549200, 
1546552800, 1546556400, 1546560000, 1546563600, 1546610400, 1546614000, 
1546617600, 1546621200, 1546624800, 1546628400, 1546632000, 1546635600, 
1546639200, 1546642800, 1546646400, 1546650000, 1546696800, 1546700400, 
1546704000, 1546707600, 1546711200, 1546714800, 1546718400, 1546722000, 
1546725600, 1546729200, 1546732800, 1546736400, 1546783200, 1546786800, 
1546790400, 1546794000, 1546797600, 1546801200, 1546804800, 1546808400, 
1546812000, 1546815600, 1546819200, 1546822800, 1546869600, 1546873200, 
1546876800, 1546880400, 1546884000, 1546887600, 1546891200, 1546894800, 
1546898400, 1546902000, 1546905600, 1546909200, 1546956000, 1546959600, 
1546963200, 1546966800, 1546970400, 1546974000, 1546977600, 1546981200, 
1546984800, 1546988400, 1546992000, 1546995600, 1547042400, 1547046000, 
1547049600, 1547053200), tzone = "", class = c("POSIXct", "POSIXt"
)), Orders = c(390.9300738, 424.5024938, 459.9507418, 493.879574, 
521.1915476, 543.6420076, 564.1188556, 583.1138192, 599.9870792, 
608.2502946, 589.0774506, 552.9864864, 460.0146478, 513.6096, 
565.54751, 614.3622836, 649.610842, 673.080916, 694.1457822, 
714.687121, 730.0065136, 727.6420116, 704.9715348, 669.8276592, 
596.9598262, 627.6943506, 663.224885, 689.6623702, 705.7821348, 
705.2804398, 702.4425002, 686.257045, 673.0440842, 631.4105592, 
590.2226836, 557.170647, 505.5489378, 514.1362306, 518.0591858, 
519.7312244, 515.538957, 517.0255898, 516.4563428, 519.1586616, 
518.8452174, 494.1823666, 468.0562396, 444.6603772, 465.6368096, 
475.6484144, 481.8889642, 489.3110196, 492.9861102, 495.1878822, 
496.2013992, 500.4736856, 502.9525222, 490.2884824, 465.9459928, 
446.4332428, 468.827488, 475.2297188, 480.3550016, 486.8966308, 
488.641556, 492.2285006, 493.485411, 501.271116, 501.7387056, 
485.8849556, 462.3912654, 444.3381798, 423.7514376, 442.9296904, 
456.1299334, 459.6065968, 466.132121, 468.2358706, 476.4634124, 
481.580409, 484.5936224, 477.3972256, 458.546062, 439.0321916, 
397.7730592, 418.1761574, 426.6949568, 435.0296632, 438.8624322, 
437.5872586, 441.9915442, 445.0284556, 443.4291354, 445.9624284, 
430.1143198, 420.7732792, 485.8293664, 494.2056144, 502.3287016, 
509.0143842)), class = c("tbl_ts", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -100L), key = structure(list(.rows = structure(list(
    1:100), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr", 
"list"))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-1L)), index = structure("Time", ordered = TRUE), index2 = "Time", interval = structure(list(
    year = 0, quarter = 0, month = 0, week = 0, day = 0, hour = 1, 
    minute = 0, second = 0, millisecond = 0, microsecond = 0, 
    nanosecond = 0, unit = 0), .regular = TRUE, class = c("interval", 
"vctrs_rcrd", "vctrs_vctr")))

I create a graph.

enter image description here

My next step is to understand the trend, seasonality and cycle in the data so I create a decomposition.

dcmp <- sales %>%
  model(stl = STL(Orders))

The code above compiles but the code below doesn't work.

dcmp <- components(dcmp)

Error in `transmute()`:
! Problem while computing `cmp = map(.fit, components)`.
Caused by error in `UseMethod()`:
! no applicable method for 'components' applied to an object of class "null_mdl"
Backtrace:
  1. generics::components(dcmp)
  8. fabletools:::map(.fit, components)
  9. base::lapply(.x, .f, ...)
 11. fabletools:::components.mdl_ts(X[[i]], ...)
 12. generics::components(object$fit, ...)

I google the error message and discover eventually that I'm supposed to fill in the missing gaps.

So I fill in the missing gaps and change the NA values that appear to 0.

sales <- sales %>%
  fill_gaps()

sales$Orders[is.na(sales$Orders)] <- 0

I'm then able to do my decomposition and graph it.

dcmp <- sales %>%
  model(stl = STL(Orders))

dcmp <- components(dcmp)

dcmp %>%
  as_tsibble() %>%
  autoplot(Orders, colour="gray") +
  geom_line(aes(y=trend), colour = "#D55E00") +
  labs(
    y = "Orders",
    title = "Orders") + 
  labs(caption = "fake data")

enter image description here But now I'm totally stuck because this isn't what I wanted. I want to find the trend in the sales but this orange trend line really undershoots it because it's averaging with all the zeros.

How does someone who is skilled in time series forecasting take data that has hours from 9am-8pm and zero values in the interval, and turn that into a time series dataframe where you can work with these time series decomposition components. I don't just want to approximate what STL is doing with a moving average because I want to the STL picture.

Here is a screenshot from the textbook that is closer to what I was expecting as the output. You see the line is closer to the real data and not pulled down with zeros.

enter image description here

I'm hoping for answers using this set of packages for time series forecasting in R because there are things in this textbook I want to use also.


Solution

  • a tsibble can't handle missing data very well for forecasting if the missing data is structural, like no weekends for stock data or data based on business hours. It works with prophet, but not for SNAIVE or other functions.

    What you can do is create an index (or use a variable as an index). In the example below I use your variable X as the index. Normally I would create one via mutate. Keep any groupings you need in mind when creating an index.

    Forecasting is done based on this index. So forecast(model, h = 12) will forecast 12 index values into the future. You then need to translate that back into your Time column.

    See code below to get you started.

    library(fpp3)
    
    sales <- df1
    
    # If not there use mutate to create an index, now I just use X as the index
    fc <- sales %>% 
      # mutate(idx = row_number()) %>% 
      tsibble(index = X)
    
    
    dcmp <- fc %>%
      model(stl = STL(Orders))
    
    dcmp %>%
      components() %>% 
      as_tsibble() %>%
      autoplot(Orders, colour="gray") +
      geom_line(aes(y=trend), colour = "#D55E00") +
      labs(
        y = "Orders",
        title = "Orders") + 
      labs(caption = "fake data")