Search code examples
pythonpython-3.xtime-seriespredictionets

Failure to predict with ETS


Very good morning to all. I am trying to make a prediction using ETS.

I have the following code:

from sktime.forecasting.ets import AutoETS


datos = [21.5294, 21.5228, 21.5289, 21.5096, 21.506, 21.5119, 21.5173, 21.5308, 21.5355, 21.5181, 21.5, 21.4972, 21.5067, 21.5149, 21.4994, 21.4967, 21.4774, 21.4662, 21.4752, 21.4858, 21.4581, 21.4398, 21.4385, 21.4471, 21.4399, 21.444, 21.4555, 21.4366, 21.4402, 21.4371, 21.4317, 21.4342, 21.411, 21.4174, 21.4149, 21.4151, 21.4186, 21.4411, 21.4569, 21.4628, 21.448, 21.4468, 21.4357, 21.4329, 21.4543, 21.4429, 21.4478, 21.4423, 21.4536, 21.4416, 21.4384, 21.4378, 21.4622, 21.4413, 21.4315, 21.4419, 21.4323, 21.429, 21.4103, 21.4194, 21.4364, 21.4245, 21.4348, 21.4276, 21.4113, 21.4235, 21.407, 21.412, 21.4263, 21.431, 21.4362, 21.432, 21.4445, 21.4487, 21.4623, 21.4766, 21.4785, 21.4891, 21.4869, 21.4903, 21.4839, 21.4856, 21.4909, 21.5048, 21.5005, 21.4905, 21.4906, 21.4914, 21.5052, 21.4898, 21.5232, 21.5234, 21.5086, 21.5108, 21.5017, 21.5141, 21.5055, 21.4953, 21.4618, 21.4504, 21.4667, 21.4602, 21.453, 21.4497, 21.4446, 21.4308, 21.4347, 21.4512, 21.4675, 21.4675, 21.465, 21.4624, 21.4682, 21.472, 21.4632, 21.4644, 21.4615, 21.4604, 21.4679, 21.4672]
indice = pd.date_range("2020-10-31 23:57:00", periods=len(datos), freq="T")

datos = pd.Series(data=datos, index=indice)

datos = datos.asfreq(freq='T')


pasado = datos[:100]
futuro = datos[100:]


model_auto = AutoETS(auto=True, initialization_method='heuristic', allow_multiplicative_trend=True, n_jobs=-1, sp=10)
model_auto.fit(pasado)


lista = list(np.array(range(20))+1)
prediccion = model_auto.predict(lista)

#print(pasado)
#print(futuro)
#print(prediccion)

pasado.plot()
futuro.plot()
prediccion.plot()
plt.show()

The result is as follows:

Predict

The blue line corresponds to the data with which I train the model.

The orange line corresponds to the 'future' data

The green line corresponds to the prediction and should be close to the orange line.

I don't know why the prediction is always the same value.

I would like to know your opinion about it. Do you know why this situation occurs in this prediction and how can I correct it?

Thank you.


Solution

  • It is not an error as such ... I am not an expert on the subject, but the short answer is: "It is due to the data set you have".

    The long answer is better with an example ... imagine for a moment that you have another set of data. If you agree they could be:

    datos = [
        30.05251300, 19.14849600, 25.31769200, 27.59143700,
        32.07645600, 23.48796100, 28.47594000, 35.12375300,
        36.83848500, 25.00701700, 30.72223000, 28.69375900,
        36.64098600, 23.82460900, 29.31168300, 31.77030900,
        35.17787700, 19.77524400, 29.60175000, 34.53884200,
        41.27359900, 26.65586200, 28.27985900, 35.19115300,
        42.20566386, 24.64917133, 32.66733514, 37.25735401,
        45.24246027, 29.35048127, 36.34420728, 41.78208136,
        49.27659843, 31.27540139, 37.85062549, 38.83704413,
        51.23690034, 31.83855162, 41.32342126, 42.79900337,
        55.70835836, 33.40714492, 42.31663797, 45.15712257,
        59.57607996, 34.83733016, 44.84168072, 46.97124960,
        60.01903094, 38.37117851, 46.97586413, 50.73379646,
        61.64687319, 39.29956937, 52.67120908, 54.33231689,
        66.83435838, 40.87118847, 51.82853579, 57.49190993,
        65.25146985, 43.06120822, 54.76075713, 59.83447494,
        73.25702747, 47.69662373, 61.09776802, 66.05576122]
    
    indice = pd.date_range("2020-10-31 23:57:00", periods=len(datos), freq="T")
    
    datos = pd.Series(data=datos, index=indice)
            
    datos = datos.asfreq(freq='T')
    

    In this way you would have a code similar to this:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from statsmodels.tsa.exponential_smoothing.ets import ETSModel
        
    datos = [
            30.05251300, 19.14849600, 25.31769200, 27.59143700,
            32.07645600, 23.48796100, 28.47594000, 35.12375300,
            36.83848500, 25.00701700, 30.72223000, 28.69375900,
            36.64098600, 23.82460900, 29.31168300, 31.77030900,
            35.17787700, 19.77524400, 29.60175000, 34.53884200,
            41.27359900, 26.65586200, 28.27985900, 35.19115300,
            42.20566386, 24.64917133, 32.66733514, 37.25735401,
            45.24246027, 29.35048127, 36.34420728, 41.78208136,
            49.27659843, 31.27540139, 37.85062549, 38.83704413,
            51.23690034, 31.83855162, 41.32342126, 42.79900337,
            55.70835836, 33.40714492, 42.31663797, 45.15712257,
            59.57607996, 34.83733016, 44.84168072, 46.97124960,
            60.01903094, 38.37117851, 46.97586413, 50.73379646,
            61.64687319, 39.29956937, 52.67120908, 54.33231689,
            66.83435838, 40.87118847, 51.82853579, 57.49190993,
            65.25146985, 43.06120822, 54.76075713, 59.83447494,
            73.25702747, 47.69662373, 61.09776802, 66.05576122]
        
    indice = pd.date_range("2020-10-31 23:57:00", periods=len(datos), freq="T")
        
    datos = pd.Series(data=datos, index=indice)
        
    datos = datos.asfreq(freq='T')
              
              
    pasado = datos[:48]
    futuro = datos[47:]
    
                  
    modelo = ETSModel(datos, error="add", trend="add", seasonal="add",
                        damped_trend=True, seasonal_periods=4)
    #modelo_fit = modelo.fit(maxiter=10000)
    fit = modelo.fit()
        
    print(fit.summary())
        
    pred = fit.get_prediction(start='2020-11-01 00:44:00', end='2020-11-01 01:04:00')
        
    df = pred.summary_frame(alpha=0.05)
        
        
    simulated = fit.simulate(anchor="end", nsimulations=10, repetitions=100)
    
    for i in range(simulated.shape[1]):
      simulated.iloc[:,i].plot(label='_', color='gray', alpha=0.1)
          
    df["mean"].plot(label='mean prediction')
    df["pi_lower"].plot(linestyle='--', color='tab:cyan', label='95% interval')
    df["pi_upper"].plot(linestyle='--', color='tab:cyan', label='_')
    
    pred.endog.plot(label='data')
    plt.legend()
    plt.show()
    

    You would get a result of this type:

    Simulate OK

    Your data is represented in orange. The ETS model estimates an average of the data in blue, and a range in which the data can vary based on the mean (which are the intermittent cyan lines). Then (in the prediction) the model performs a simulation trying to forecast, 10 steps forward, and makes 100 attempts (they are the gray lines).

    In this particular case, the model fits the data very well ... but of course! It is a textbook example, so it will work perfectly - in daily practice the theory is different.

    Although you use another library, in general it serves to explain why the result you get.

    The ETS model when used for prediction has several functions available:

    • forecast: Make predictions from sample
    • predict: In-sample and out-of-sample predictions
    • simulate: Run simulations of the state space model
    • get_prediction: In-sample and out-of-sample predictions, as well as prediction intervals.

    In your case, the data is stochastic for lack of another word in the eyes of the model and this particular model has a hard time generating or deciding where the data can go in the future, so it estimates an average, upper and lower ranges. in which the data may be in the future.

    Let's take the same code, and just vary the data, you would have something like this:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from statsmodels.tsa.exponential_smoothing.ets import ETSModel
    
        
    pasado = datos[:100]
    futuro = datos[99:]
    print(futuro)
            
    modelo = ETSModel(datos, error="add", trend="add", seasonal="add",
                  damped_trend=True, seasonal_periods=4)
    #modelo_fit = modelo.fit(maxiter=10000)
    fit = modelo.fit()
    
    print(fit.summary())
    
    #prediccion = modelo_fit.get_prediction(start='2020-11-01 01:37:00', end='2020-11-01 01:57:00')
    pred = fit.get_prediction(start='2020-11-01 01:36:00', end='2020-11-01 01:56:00')
    
    df = pred.summary_frame(alpha=0.05)
    
    
    
    
    simulated = fit.simulate(anchor="end", nsimulations=20, repetitions=100)
    for i in range(simulated.shape[1]):
      simulated.iloc[:,i].plot(label='_', color='gray', alpha=0.1)
    
    
    df["mean"].plot(label='mean prediction')
    df["pi_lower"].plot(linestyle='--', color='tab:cyan', label='95% interval')
    df["pi_upper"].plot(linestyle='--', color='tab:cyan', label='_')
    pred.endog.plot(label='data')
    
    pasado.plot(label='Pasado')
    futuro.plot(label='Futuro')
    
    
    
    plt.legend()
    plt.show()
    

    Simulacion

    After the training data (in green color), there follows a kind of bubble (what is contained between the cyan dotted lines), which is an estimate (according to the model), of where the data could be in the future , so the line that generally appears to you with the same value, is the estimated average of the future values that the model predicts. In other words, according to the data, the model cannot be adjusted precisely to the data you have in the future variable.

    Sim 1

    Forecast

    A model that can (definitely ... maybe) fit better to the data could be SARIMA or SARIMAX, it is best to search (for the previous cases) for some mechanism / library that fits the values order = (p, d, q) and seasonal_order = (P, D, Q, s) automatically (although the computational cost may start to rise).

    Of course, there are many more models, Mathematica has a function (which I can't remember at this time) and it looks for the model and the parameter set that best suits the data. Maybe Python somewhere has something similar —if so, I'd love to hear about it.

    SARIMA