Search code examples
python-3.xpandasindexingstatsmodelsarima

Python- ARIMA predictions returning all NaNs


I'm trying to follow the time series tutorial here (using my own dataset):

https://www.analyticsvidhya.com/blog/2018/02/time-series-forecasting-methods/

Surprisingly, I am able to satisfactorily successfully reach Part 7: ARIMA. In this section, I am stumbling quite a bit. All the values in the Prediction column for it are NaN.

In the terminal, I see a date index has been provided but it has no associated frequency information and so will be ignored when forecasting

My test data set has a few date gaps for when no transactions occurred, so I fill it with test=test.set_index('DATE').asfreq('D', fill_value=0) . I also do the same thing with my ARIMA dataset, so the index matches the test set.

The rest of the relevant code is as follows:

train=df[0:180]
test=df[180:]
SARIMA=test.copy()

fit=sm.tsa.statespace.SARIMAX(train['COUNT'], order=(1,1,1), seasonal_order=(0,0,0,5)).fit()
SARIMA['SARIMA']=fit3.predict(start=0, 
    end=93,dynamic=True)

print(SARIMA) 
print(test)

In the print output, the index for the test set and ARIMA set are the same. The ARIMA contains a column SARIMA which contains the predictions, except they are all NaN. What am I missing?

test
DATE        COUNT
2018-06-21    1
2018-06-22    3
..
2018-11-21    3
2018-11-22    4

SARIMA
DATE        COUNT    SARIMA
2018-06-21    1       NaN
2018-06-22    3       NaN
..
2018-11-21    3       NaN
2018-11-22    4        NaN

edit: for some reason statsmodels simply cannot detect the index frequency. I've tried SARIMA=SARIMA.set_index('DATE').asfreq('D',fill_value=0) SARIMA.index=pd.to_datetime(SARIMA.index) SARIM.index=pd.DatetimeIndex(SARIMA.index.values, freq='D') But the warning always appears

edit: I straight up tried to make a new dataset in Excel:

DATE       COUNT
2018/01/01   1
2018/01/02   2
..
2018/01/10   3
2018/01/11   4

created the model with the same lines above, except setting enforce_stationarity and enforce invertibility to False. All the predictions are still NaN

edit3: using the fake excel dataset, I've come 1 step closer. Passing start='2018-01-01' and end='2018-01-21' yielded predictions of all 0s, which is better than NaN. Can anyone make sense of these results?

edit4: setting dynamic=False returned reasonable predictions. Clearly I'm no statistican


Solution

  • Another reason behind this behavior could be the 'sarimax' parameters. I have not found a way to overwrite it yet, so if this is the cause try changing your initial params.

    import random
    import statsmodels.api
    import numpy as np
    import matplotlib.pyplot as plt
    
    endog = np.array(random.sample(range(100,200), 17))
    
    for cd in range(2):
    
        m = statsmodels.api.tsa.statespace.SARIMAX(
                                                        endog = endog,
                                                        order = (1,1,1),
                                                        seasonal_order = (0,cd,0,12),
                                                        trend = 'n'
                                                   ).fit()
    
    
        plt.plot(endog)
        plt.plot(m.fittedvalues)
        plt.title('D: ' + str(cd))
        plt.show()
    

    enter image description here

    enter image description here