Search code examples
pythonmachine-learningscikit-learncross-validationgrid-search

Combining RandomizedSearchCV (or GridSearcCV) with LeaveOneGroupOut cross validation in scikit-learn


I like using scikit's LOGO (leave one group out) as a cross validation method, in combination with learning curves. This works really nicely in most of the cases I deal with, but I am only able (efficiently) to use the two parameters that are (I believe) most critical in those cases (from experience): max features and number of estimators. Example of my code below:

    Fscorer = make_scorer(f1_score, average = 'micro')
    gp = training_data["GP"].values
    logo = LeaveOneGroupOut()
    from sklearn.ensemble import RandomForestClassifier
    RF_clf100 = RandomForestClassifier (n_estimators=100, n_jobs=-1, random_state = 49)
    RF_clf200 = RandomForestClassifier (n_estimators=200, n_jobs=-1, random_state = 49)
    RF_clf300 = RandomForestClassifier (n_estimators=300, n_jobs=-1, random_state = 49)
    RF_clf400 = RandomForestClassifier (n_estimators=400, n_jobs=-1, random_state = 49)
    RF_clf500 = RandomForestClassifier (n_estimators=500, n_jobs=-1, random_state = 49)
    RF_clf600 = RandomForestClassifier (n_estimators=600, n_jobs=-1, random_state = 49)

    param_name = "max_features"
    param_range = param_range = [5, 10, 15, 20, 25, 30]


    plt.figure()
    plt.suptitle('n_estimators = 100', fontsize=14, fontweight='bold')
    _, test_scores = validation_curve(RF_clf100, X, y, cv=logo.split(X, y, groups=gp),
                                      param_name=param_name, param_range=param_range,
                                      scoring=Fscorer, n_jobs=-1)
    test_scores_mean = np.mean(test_scores, axis=1)
    plt.plot(param_range, test_scores_mean)
    plt.xlabel(param_name)
    plt.xlim(min(param_range), max(param_range))
    plt.ylabel("F1")
    plt.ylim(0.47, 0.57)
    plt.legend(loc="best")
    plt.show()


    plt.figure()
    plt.suptitle('n_estimators = 200', fontsize=14, fontweight='bold')
    _, test_scores = validation_curve(RF_clf200, X, y, cv=logo.split(X, y, groups=gp),
                                      param_name=param_name, param_range=param_range,
                                      scoring=Fscorer, n_jobs=-1)
    test_scores_mean = np.mean(test_scores, axis=1)
    plt.plot(param_range, test_scores_mean)
    plt.xlabel(param_name)
    plt.xlim(min(param_range), max(param_range))
    plt.ylabel("F1")
    plt.ylim(0.47, 0.57)
    plt.legend(loc="best")
    plt.show()
    ...
    ...

What I would really like though is to combine the LOGO with grid search, or randomized search, for a more thorough parameter space search.

As of now my code looks like this:

param_dist = {"n_estimators": [100, 200, 300, 400, 500, 600],
              "max_features": sp_randint(5, 30),
              "max_depth": sp_randint(2, 18),
              "criterion": ['entropy', 'gini'],
              "min_samples_leaf": sp_randint(2, 17)}

clf = RandomForestClassifier(random_state = 49)

n_iter_search = 45
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search,
                                   scoring=Fscorer, cv=8,
                                   n_jobs=-1)
random_search.fit(X, y)

When I replace cv = 8 with cv=logo.split(X, y, groups=gp), I get this error message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-0092e11ffbf4> in <module>()
---> 35 random_search.fit(X, y)


/Applications/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_search.pyc in fit(self, X, y, groups)
   1183                                           self.n_iter,
   1184                                           random_state=self.random_state)
-> 1185         return self._fit(X, y, groups, sampled_params)

/Applications/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_search.pyc in _fit(self, X, y, groups, parameter_iterable)
    540 
    541         X, y, groups = indexable(X, y, groups)
--> 542         n_splits = cv.get_n_splits(X, y, groups)
    543         if self.verbose > 0 and isinstance(parameter_iterable, Sized):
    544             n_candidates = len(parameter_iterable)

/Applications/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_split.pyc in get_n_splits(self, X, y, groups)
   1489             Returns the number of splitting iterations in the cross-validator.
   1490         """
-> 1491         return len(self.cv)  # Both iterables and old-cv objects support len
   1492 
   1493     def split(self, X=None, y=None, groups=None):

TypeError: object of type 'generator' has no len()

Any suggestions as to (1) what is happening and, more importantly, (2) how I can make it work (combining RandomizedSearchCV with LeaveOneGroupOut)?

* UPDATE Feb. 08 2017*

It worked using cv=logo with @Vivek Kumar' suggestion of random_search.fit(X, y, wells)


Solution

  • You should not pass logo.split() into the RandomizedSearchCV, only pass a cv object like logo into it. The RandomizedSearchCV internally calls split() to generate train test indices. You can pass your gp groups into the fit() call to RandomizedSearchCV or GridSearchCV object.

    Instead of doing this:

    random_search.fit(X, y)
    

    Do this:

    random_search.fit(X, y, gp)
    

    EDIT: you can also pass gp to the constructor of GridSearchCV or RandomizedSearchCV in the parameter fit_params as a dict.