Search code examples
machine-learningscikit-learntime-seriesclassificationgrid-search

How to train with TimeSeriesSplit from sklearn?


I have this kind of data (columns):

| year-month | client_id | Y | X1.. Xn |

Where Y is if the client client_id purchased the product in a given year-month. And X are the explanatory variables. I have two years of monthly data, and I have done the split correctly with TimeSeriesSplit() given in this answer. The problem now, is that I'm looking to do a GridSearchCV() on that split, trying different models (RF, XGBoostClassifier(), LightGBM(), etc.) with different hyperparameters, but I can't figure out a way to use the GridSearchCV() with the split done.

Any suggestions?


Solution

  • Assuming you have splits df based on this question. First save indices for each Fold into arrays of tuples (train,test), i.e,:

     [(train_indices, test_indices), # 1stfold
      (train_indices, test_indices)] # 2nd fold etc
    

    The following code will do this:

    custom_cv = []
    
    for FOLD_train,FOLD_test in zip(splits['train'],splits['test']):
        custom_cv.append((np.array(FOLD_train.index.values.tolist()),np.array(FOLD_test.index.values.tolist())))
    

    you can use GridSearchCV() in the following manner:

    Here we create dictionary with classifier functions and another dictionary with param list

    This is just a sample make sure to limit search space when testing,

    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.svm import SVC
    from sklearn.model_selection import GridSearchCV
    from xgboost import XGBRegressor
    dict_classifiers = {
    
        "Random Forest": RandomForestClassifier(),
        "Gradient Boosting Classifier": GradientBoostingClassifier(),
        "Linear SVM": SVC(),
        "XGB": XGBRegressor(),
        "Logistic Regression": LogisticRegression(),
        "Nearest Neighbors": KNeighborsClassifier(),
        "Decision Tree": DecisionTreeClassifier(),
    }
    
    params = {
        "Random Forest": {"max_depth": range(5, 30, 5), "min_samples_leaf": range(1, 30, 2),
                          "n_estimators": range(100, 2000, 200)},
    
        "Gradient Boosting Classifier": {"learning_rate": [0.001, 0.01, 0.1], "n_estimators": range(1000, 3000, 200)},
        "Linear SVM": {"kernel": ["rbf", "poly"], "gamma": ["auto", "scale"], "degree": range(1, 6, 1)},
        "XGB": {'min_child_weight': [1, 5, 10],
                'gamma': [0.5, 1, 1.5, 2, 5],
                'subsample': [0.6, 0.8, 1.0],
                'colsample_bytree': [0.6, 0.8, 1.0],
                'max_depth': [3, 4, 5], "n_estimators": [300, 600],
                "learning_rate": [0.001, 0.01, 0.1],
                },
        "Logistic Regression": {'penalty': ['none', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
        "Nearest Neighbors": {'n_neighbors': [3, 5, 11, 19], 'weights': ['uniform', 'distance'], 'metric': ['euclidean', 'manhattan']},
        "Decision Tree": {'criterion': ['gini', 'entropy'], 'max_depth': np.arange(3, 15)},
    
    }
    
    
    for classifier_name in dict_classifiers.keys() & params:
    
        print("training: ", classifier_name)
        gridSearch = GridSearchCV(
            estimator=dict_classifiers[classifier_name], param_grid=params[classifier_name], cv=custom_cv)
        gridSearch.fit(df[['X']].to_numpy(), # shoud have shape of (n_samples, n_features) 
                       df[['Y']].to_numpy().reshape((-1))) #this should be an array with shape (n_samples,)
        print(gridSearch.best_score_, gridSearch.best_params_)
    

    replace ['X'] with df.columns[pd.Series(df.columns).str.startswith('X')] on gridsearch.fit, if you want to pass in all columns starting with 'X' in their name (e.g., 'X1','X2', ...) as train_set.