Search code examples
pythondeep-learningtime-serieslstmforecasting

LSTM forecasting with single categorical feature


I'm pretty new to time series.
This is the dataset I'm working on:

           Date   Price               Location
0    2012-01-01  1771.0                 Marche
1    2012-01-01  1039.0               Calabria
2    2012-01-01  2193.0               Campania
3    2012-01-01  2015.0         Emilia-Romagna
4    2012-01-01  1483.0  Friuli-Venezia Giulia
...         ...     ...                    ...
2475 2022-04-01  1963.0                  Lazio
2476 2022-04-01  1362.0  Friuli-Venezia Giulia
2477 2022-04-01  1674.0         Emilia-Romagna
2478 2022-04-01  1388.0                 Marche
2479 2022-04-01  1103.0                Abruzzo

I'm trying to build an LSTM for price prediction, but I don't know how to manage the Location categorical feature: do I have to use one-hot encoding or a groupby? What I want to predict is the price based on the location.
How can I achieve that? A Python solution is particularly appreciated.

Thanks in advance.


Solution

  • Suppose my dataset (df) is analogous to yours:

              Date       Price  Location
    0   2021-01-01  791.076890  Campania
    1   2021-01-01  705.702464  Lombardia
    2   2021-01-01  719.991382  Sicilia
    3   2021-02-01  825.760917  Lombardia
    4   2021-02-01  747.734309  Sicilia
    ...        ...         ...        ...
    31  2021-11-01  886.874348  Lombardia
    32  2021-11-01  935.040583  Campania
    33  2021-12-01  771.165378  Sicilia
    34  2021-12-01  952.255227  Campania
    35  2021-12-01  939.754515  Lombardia
    

    In my case I have a Price record for 3 regions (Campania, Lombardia, Sicilia) every month. My Idea is to treat the different region as different features, so I would transform df as:

    df = df.set_index(["Date", "Location"]).Price.unstack()
    

    Now my dataset is like:

    Location    Campania    Lombardia   Sicilia
    Date            
    2021-01-01  791.076890  705.702464  719.991382
    2021-02-01  758.872755  825.760917  747.734309
    2021-03-01  880.038005  803.165998  837.738419
           ...         ...         ...         ...
    2021-10-01  908.402345  805.081193  792.369610
    2021-11-01  935.040583  886.874348  736.862025
    2021-12-01  952.255227  939.754515  771.165378
    

    Please, after this, make sure there are no NaN values (df.isna().sum()).

    Now you can pass this data to a multi feature RNN (or LSTM), as made in this example, or to a multi-channel 1D-CNN (choosing an appropriate kernel size). The only problem in both cases could be the small size of the dataset, so try to not to over-parameterize the model (for example reducing the number of neurons and layers), otherwise the over-fitting will be unavoidable. About this you can test the model on the last 20% of your time-series:

    from sklearn.model_selection import train_test_split
    df_train, df_test = train_test_split(df, shuffle=False, test_size=.2)
    

    The last part is to build a matching (X, Y) for the supervised learning, but this depends on what model are you using and what is your prediction task. Another example here.