Search code examples
pythonscikit-learntime-serieshyperparameterspanel-data

How to do hyper-parameter tuning with panel data in sklearn framework?


Imagine we have multiple time-series observations for multiple entities, and we want to perform hyper-parameter tuning on a single model, splitting the data in a time-series cross-validation fashion.

To my knowledge, there isn't a straightforward solution to performing this hyper-parameter tuning operation within the scikit-learn framework. There exists the functionality to do this with a single time-series using TimeSeriesSplit, however this doesn't work for multiple entities.

As a simple example imagine we have a dataframe:

from itertools import product

# create a dataframe
countries = ['ESP','FRA']
periods = list(range(10))
df = pd.DataFrame(list(product(countries,periods)), columns = ['country','period'])
df['target'] = np.concatenate((np.repeat(1, 10), np.repeat(0, 10)))
df['a_feature'] = np.random.randn(20, 1)

# this produces the following dataframe:
country,period,target,a_feature
ESP,0,1,0.08
ESP,1,1,-2.0
ESP,2,1,0.1
ESP,3,1,-0.59
ESP,4,1,-0.83
ESP,5,1,0.05
ESP,6,1,0.05
ESP,7,1,0.42
ESP,8,1,0.04
ESP,9,1,2.17
FRA,0,0,-0.44
FRA,1,0,-0.48
FRA,2,0,0.82
FRA,3,0,-1.64
FRA,4,0,0.19
FRA,5,0,0.6
FRA,6,0,-0.73
FRA,7,0,-0.5
FRA,8,0,1.11
FRA,9,0,-0.75

And we want to train a single model across Spain and France so that we take all the data up to a certain period, and then predict using that trained model the next period for both Spain and France. And we want to assess which set of hyper-parameters work best for performance.

How to do hyper-parameter tuning with panel data in an time-series cross-validation framework?

Similar questions have been asked here:


Solution

  • PanelSplit

    I propose PanelSplit, a custom cross-validator for panel-data. It's essentially a wrapper for TimeSeriesSplit, taking similar same arguments as TimeSeriesSplit but allowing for panel-data functionality.

    PanelSplit works essentially as follows:

    1. Create train and test indices for each fold by passing the period series to TimeSeriesSplit
    2. For the train and test sets of each fold, substitute the indices with the corresponding period values
    3. For each train and test periods of each fold, filter for the period values in the panel data's periods and return their indices.
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import TimeSeriesSplit
    
    class PanelSplit:
        def __init__(self, unique_periods, train_periods, n_splits = 5, gap = 0, test_size = None,  max_train_size=None):
            """
            A class for performing time series cross-validation with custom train/test splits based on unique periods.
    
            Parameters:
            - n_splits: Number of splits for TimeSeriesSplit
            - gap: Gap between train and test sets in TimeSeriesSplit
            - test_size: Size of the test set in TimeSeriesSplit
            - unique_periods: Pandas DataFrame or Series containing unique periods
            - train_periods: All available training periods
            - max_train_size: Maximum size for a single training set.
            """
            self.tss = TimeSeriesSplit(n_splits=n_splits, gap=gap, test_size=test_size, max_train_size = max_train_size)
            indices = self.tss.split(unique_periods)
            self.u_periods_cv = self.split_unique_periods(indices, unique_periods)
            self.all_periods = train_periods
            self.n_splits = n_splits
            
        def split_unique_periods(self, indices, unique_periods):
            """
            Split unique periods into train/test sets based on TimeSeriesSplit indices.
    
            Parameters:
            - indices: TimeSeriesSplit indices
            - unique_periods: Pandas DataFrame or Series containing unique periods
    
            Returns: List of tuples containing train and test periods
            """
            u_periods_cv = []
            for i, (train_index, test_index) in enumerate(indices):
                unique_train_periods = unique_periods.iloc[train_index].values
                unique_test_periods = unique_periods.iloc[test_index].values
                u_periods_cv.append((unique_train_periods, unique_test_periods))
            return u_periods_cv
    
        def split(self, X = None, y = None, groups=None):
            """
            Generate train/test indices based on unique periods.
            """
            self.all_indices = []
            
            for i, (train_periods, test_periods) in enumerate(self.u_periods_cv):
                train_indices = self.all_periods.loc[self.all_periods.isin(train_periods)].index
                test_indices = self.all_periods.loc[self.all_periods.isin(test_periods)].index
                self.all_indices.append((train_indices, test_indices))
            
            return self.all_indices
       
        def get_n_splits(self, X=None, y =None, groups=None):
            """
            Returns: Number of splits
            """
            return self.n_splits
    
    

    Hyper-parameter tuning with PanelSplit

    Here is a demo of how it can be used as a cross-validator for hyperparameter tuning.

    Before doing hyperparameter tuning in a real setting, I reset indices and drop NaN values with respect to both feature variables and the target. This usually saves me from indexing errors.

    from itertools import product
    
    # create a dataframe
    countries = ['ESP','FRA']
    periods = list(range(10))
    df = pd.DataFrame(list(product(countries,periods)), columns=['country','period'])
    df['target'] = np.concatenate((np.repeat(1, 10), np.repeat(0, 10)))
    df['a_feature'] = np.random.randn(20, 1)
    
    unique_periods = pd.Series(df.period.unique())
    panel_split = PanelSplit(n_splits=3,
                             unique_periods= unique_periods, train_periods=df.period)
    
    from sklearn.model_selection import GridSearchCV
    from sklearn.ensemble import RandomForestClassifier
    
    param_grid = {'max_depth': [2, 3]}
    
    param_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=panel_split)
    param_search.fit(df[['a_feature']], df['target'])