Search code examples
pythonpandasdataframematplotlibstatsmodels

Predicting future data using a single column as input data


I am having trouble with predicting future values with my input set. I am fairly new with statsmodels so I am not sure if it is even possible to do with this much input data.

This is the DataFrame that I am using. (Note: Starts at index 5 since I had to filter some data)

    year  suicides_no
5   1990       193361
6   1991       198020
7   1992       211473
8   1993       221565
9   1994       232063
10  1995       243544
11  1996       246725
12  1997       240745
13  1998       249591
14  1999       256119
15  2000       255832
16  2001       250652
17  2002       256095
18  2003       256079
19  2004       240861
20  2005       234375
21  2006       233361
22  2007       233408
23  2008       235447
24  2009       243487
25  2010       238702
26  2011       236484
27  2012       230160
28  2013       223199
29  2014       222984
30  2015       203640

From this, id like to get a prediction for the years (2016-2022) and plot it to a graph like this one.


Solution

  • This is a rather open-ended problem. I can certainly show you how you might write some code to make a prediction, but I think discussing how to make a good prediction is beyond the scope of StackOverflow. It will be very dependent on a good understanding of the problem domain.

    But with that caveat aside, on with the show. You've suggested you'd like to see a Statsmodel example.

    Statsmodels is certainly capable of these sorts of forecasts. There are lots of approaches but yes, you can take a 1D time-series and use it to make future predictions.

    There's also a detailed tutorial of state space models here - this is a common approach, or rather, family of approaches. Different state-space models would be used depending on e.g. whether you feel seasonality (cyclic behaviour), or certain exogenous variables (contextual drivers of behaviour) are important or not.

    I adapted a simple example from there:

    import pandas as pd
    import statsmodels as sm
    
    # df = your DataFrame
    
    endog = df.suicides_number
    endog.index = pd.period_range("1990", "2015", freq="Y") 
    
    # Construct the (very simple) AR model
    mod = sm.tsa.SARIMAX(endog, order=(1, 0, 0), trend='c')
    
    # Estimate the parameters
    res = mod.fit()
    
    res.forecast(steps=7)
    

    The order parameter determines exactly what sort of model you get. This is pretty simple, an autoregression model that looks at past behaviour, recent behaviour, and extrapolates forward.

    As I said, I cannot guarantee it will give you a good forecast here (it's definitely a reach to take 25 samples forward to predict the next 7), but you could test different parameters and read up on this type of model.