machine-learning scikit-learn time-series classification grid-search

How to train with TimeSeriesSplit from sklearn?

I have this kind of data (columns):

| year-month | client_id | Y | X1.. Xn |

Where Y is if the client client_id purchased the product in a given year-month. And X are the explanatory variables. I have two years of monthly data, and I have done the split correctly with TimeSeriesSplit() given in this answer. The problem now, is that I'm looking to do a GridSearchCV() on that split, trying different models (RF, XGBoostClassifier(), LightGBM(), etc.) with different hyperparameters, but I can't figure out a way to use the GridSearchCV() with the split done.

Any suggestions?

Solution

Assuming you have splits df based on this question. First save indices for each Fold into arrays of tuples (train,test), i.e,:

 [(train_indices, test_indices), # 1stfold
  (train_indices, test_indices)] # 2nd fold etc

The following code will do this:

custom_cv = []

for FOLD_train,FOLD_test in zip(splits['train'],splits['test']):
    custom_cv.append((np.array(FOLD_train.index.values.tolist()),np.array(FOLD_test.index.values.tolist())))

you can use GridSearchCV() in the following manner:

Here we create dictionary with classifier functions and another dictionary with param list

This is just a sample make sure to limit search space when testing,

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor
dict_classifiers = {

    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting Classifier": GradientBoostingClassifier(),
    "Linear SVM": SVC(),
    "XGB": XGBRegressor(),
    "Logistic Regression": LogisticRegression(),
    "Nearest Neighbors": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
}

params = {
    "Random Forest": {"max_depth": range(5, 30, 5), "min_samples_leaf": range(1, 30, 2),
                      "n_estimators": range(100, 2000, 200)},

    "Gradient Boosting Classifier": {"learning_rate": [0.001, 0.01, 0.1], "n_estimators": range(1000, 3000, 200)},
    "Linear SVM": {"kernel": ["rbf", "poly"], "gamma": ["auto", "scale"], "degree": range(1, 6, 1)},
    "XGB": {'min_child_weight': [1, 5, 10],
            'gamma': [0.5, 1, 1.5, 2, 5],
            'subsample': [0.6, 0.8, 1.0],
            'colsample_bytree': [0.6, 0.8, 1.0],
            'max_depth': [3, 4, 5], "n_estimators": [300, 600],
            "learning_rate": [0.001, 0.01, 0.1],
            },
    "Logistic Regression": {'penalty': ['none', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
    "Nearest Neighbors": {'n_neighbors': [3, 5, 11, 19], 'weights': ['uniform', 'distance'], 'metric': ['euclidean', 'manhattan']},
    "Decision Tree": {'criterion': ['gini', 'entropy'], 'max_depth': np.arange(3, 15)},

}


for classifier_name in dict_classifiers.keys() & params:

    print("training: ", classifier_name)
    gridSearch = GridSearchCV(
        estimator=dict_classifiers[classifier_name], param_grid=params[classifier_name], cv=custom_cv)
    gridSearch.fit(df[['X']].to_numpy(), # shoud have shape of (n_samples, n_features) 
                   df[['Y']].to_numpy().reshape((-1))) #this should be an array with shape (n_samples,)
    print(gridSearch.best_score_, gridSearch.best_params_)

replace ['X'] with df.columns[pd.Series(df.columns).str.startswith('X')] on gridsearch.fit, if you want to pass in all columns starting with 'X' in their name (e.g., 'X1','X2', ...) as train_set.