I'm pretty new to time series.
This is the dataset I'm working on:
Date Price Location
0 2012-01-01 1771.0 Marche
1 2012-01-01 1039.0 Calabria
2 2012-01-01 2193.0 Campania
3 2012-01-01 2015.0 Emilia-Romagna
4 2012-01-01 1483.0 Friuli-Venezia Giulia
... ... ... ...
2475 2022-04-01 1963.0 Lazio
2476 2022-04-01 1362.0 Friuli-Venezia Giulia
2477 2022-04-01 1674.0 Emilia-Romagna
2478 2022-04-01 1388.0 Marche
2479 2022-04-01 1103.0 Abruzzo
I'm trying to build an LSTM for price prediction, but I don't know how to manage the Location categorical feature: do I have to use one-hot encoding or a groupby?
What I want to predict is the price based on the location.
How can I achieve that? A Python solution is particularly appreciated.
Thanks in advance.
Suppose my dataset (df
) is analogous to yours:
Date Price Location
0 2021-01-01 791.076890 Campania
1 2021-01-01 705.702464 Lombardia
2 2021-01-01 719.991382 Sicilia
3 2021-02-01 825.760917 Lombardia
4 2021-02-01 747.734309 Sicilia
... ... ... ...
31 2021-11-01 886.874348 Lombardia
32 2021-11-01 935.040583 Campania
33 2021-12-01 771.165378 Sicilia
34 2021-12-01 952.255227 Campania
35 2021-12-01 939.754515 Lombardia
In my case I have a Price
record for 3 regions (Campania, Lombardia, Sicilia) every month. My Idea is to treat the different region as different features, so I would transform df
as:
df = df.set_index(["Date", "Location"]).Price.unstack()
Now my dataset is like:
Location Campania Lombardia Sicilia
Date
2021-01-01 791.076890 705.702464 719.991382
2021-02-01 758.872755 825.760917 747.734309
2021-03-01 880.038005 803.165998 837.738419
... ... ... ...
2021-10-01 908.402345 805.081193 792.369610
2021-11-01 935.040583 886.874348 736.862025
2021-12-01 952.255227 939.754515 771.165378
Please, after this, make sure there are no NaN
values (df.isna().sum()
).
Now you can pass this data to a multi feature RNN (or LSTM), as made in this example, or to a multi-channel 1D-CNN (choosing an appropriate kernel size). The only problem in both cases could be the small size of the dataset, so try to not to over-parameterize the model (for example reducing the number of neurons and layers), otherwise the over-fitting will be unavoidable. About this you can test the model on the last 20% of your time-series:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, shuffle=False, test_size=.2)
The last part is to build a matching (X, Y)
for the supervised learning, but this depends on what model are you using and what is your prediction task. Another example here.