Search code examples
rstatisticsforecastingfacebook-prophet

Box Cox Transformation with zero Data Daily


I have a daily data of sales with zero values (by holidays and sundays) and I want to apply boxCox.lambda() function, but clearly with the zero values this is impossible. Mi options actually are:

1 - Change the zero values ​​by values ​​approaching zero, but I do not know how this can affect my forecast.

Any suggestions I will be grateful.

This is my data:

Data
0
2621
3407
3644
3569
1212
0
0
4473
3885
3671
3641
1453
0
4182
3812
3650
3444
3557
1612
0
4004
3631
3342
3203
3424
1597
0
4280
3644
3642
3696
3793
1753
0
4416
3935
3522
3544
3569
1649
0
3871
3442
3144
3158
3693
1780
0
4322
3682
3499
3279
3485
1716
0
4255
3713
3470
3673
3983
1931
0
4771
3986
3833
3501
3620
1710
0
4407
3799
3654
3332
3693
1780
0
0
4574
4016
3748
3559
1625
0
4548
3726
2780
0
0
122
0
5005
4300
3772
3929
3917
2021
0
4820
4117
3668
3664
3639
1742
0
4473
4151
3844
3499
3736
1838
0
4346
3693
3297
3327
3639
1773
0
4519
0
4352
4079
4143
1970
0
4693
4018
3679
3838
3606
1601
0
0
4289
4011
3742
3710
1781
0
4186
3707
3600
3484
3702
1747
0
4195
3838
3504
3609
3934
1943
0
0
5243
4754
4164
4121
1854
0
0
5173
4518
3875
3889
1904
0
5105
4056
4186
4079
3953
1846
0
4543
4341
4013
2998
4048
1767
0
0
4317
5260
5185
4969
2046
0
5683
5004
4567
4542
4266
2065
0
4357
5281
4830
4510
0
1567
0
5818
4906
4518
4218
4275
2074
0
5005
4645
4543
4558
4574
2129
0
4755
0
4458
3845
3746
1689
0
4285
3476
3447
2959
3470
1584
0
0
4159
3881
3533
3360
1643
0
4152
3748
3329
3112
3303
1790
0
3852
4190
3482
3313
3400
1582
0
4042
3706
3451
3137
3178
1518
0
4077
3754
3429
3369
3307
1467
0
3918
3620
3442
3302
3168
1630
0
3967
3707
3397
3294
3314
1646
0
4196
3812
3478
3111
3113
1411
0
0
3717
3501
3282
3366
1554
0
3737
3428
3028
2960
2977
1513
0
3608
3306
2941
2918
3238
1543
0
0
3959
3678
3367
3237
1024
0
0
4057
3562
3344
3367
1602
0
3784
3581
3395
2948
3009
1446
0
3676
3276
3112
3125
3133
1502
0
4200
4027
3739
3531
3222
2
0
4446
4342
4066
3811
2932
1643
0
4587
4534
4146
3994
3350
1400
0
1248
0
4248
4629
4346
1844
0
168

Solution

  • I'd recommend you just drop all the Sundays from your data. As we know they will alway s be zero there is no point in spending time and effort on forecasting them.

    The periodicity is very strong even with them removed, and diagnosing the data by looking at acf plots etc. is much more straight forward.

    # Removing every Sunday and creating a ts object of appropriate frequency
    x6 <- x[seq_along(x) %% 7 != 0]
    x6.ts <- ts(x6, frequency=6)
    
    # Plenty of periodic structure left
    par(mfcol=c(2, 1))
    sp <- split(x6.ts, (seq_along(x6.ts)-1) %% 6 + 1)
    stripchart(sp, vertical=TRUE, col=rainbow(6, alpha=0.2, start=0.97), pch=16,
      method="jitter", group.names=c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))
    plot.default(x6.ts, type="p", pch=16, col=rainbow(6, alpha=0.6, start=0.97))
    

    enter image description here

    The we could f.ex apply a SARIMA model

    acf(x6.ts, adj=c(0.5))
    title("x6.ts", cex.main=0.9)
    
    acf(diff(x6.ts, lag=6))
    title("diff(x6.ts, lag=6)", cex.main=0.9)
    

    enter image description here

    I see a seasonal random walk there, and once we take the seasonal difference we see that there's at least a couple of seasonal autoregressive components, and maybe a non-seasonal autoregression.

    aa6.1 <- arima(x6.ts, order=c(0, 0, 0), seasonal=c(1, 1, 0))
    aa6.2 <- arima(x6.ts, order=c(0, 0, 0), seasonal=c(2, 1, 0))
    aa6.3 <- arima(x6.ts, order=c(1, 0, 0), seasonal=c(2, 1, 0))
    aa6.4 <- arima(x6.ts, order=c(1, 0, 0), seasonal=c(3, 1, 0))
    
    dummy11 <- model.matrix(~ as.factor(seq_along(x6.ts) %% 11))[,2]
    aa6.5 <- arima(x6.ts, order=c(1, 0, 0), seasonal=c(3, 1, 0),
      xreg=dummy11)
    
    
    AIC(aa6.1, aa6.2, aa6.3, aa6.4, aa6.5)
    #       df      AIC
    # aa6.1  2 5244.846
    # aa6.2  3 5195.019
    # aa6.3  4 5192.212
    # aa6.4  5 5179.310
    # aa6.5  6 5164.567
    
    acfr <- function(x){
        a <- acf(residuals(x), plot=FALSE)
        a$acf[1, 1, 1] <- 0
        plot(a, main="", frame.plot=FALSE, ylim=c(-0.2, 0.2))
        mod <- paste(paste(names(x$call), 
          as.character(x$call), sep="=")[-1], collapse=", ")
        text(-0.1, 0.19, pos=4, xpd=NA,
          paste0("AIC: ", round(x$aic), "\n", "Mod: ", mod))
    }
    
    par(mfcol=c(5, 1))
    k <- lapply(list(aa6.1, aa6.2, aa6.3, aa6.4, aa6.5), acfr)
    

    enter image description here

    Seems like (1 0 0) (3 1 0)[6] does a decent job, but there's a persistent autocorrelation at lag 11. This is an artefact of the removal of Sundays, but we can address it by including an external regressor of dummys.