I like using scikit's LOGO (leave one group out) as a cross validation method, in combination with learning curves. This works really nicely in most of the cases I deal with, but I am only able (efficiently) to use the two parameters that are (I believe) most critical in those cases (from experience): max features and number of estimators. Example of my code below:
Fscorer = make_scorer(f1_score, average = 'micro')
gp = training_data["GP"].values
logo = LeaveOneGroupOut()
from sklearn.ensemble import RandomForestClassifier
RF_clf100 = RandomForestClassifier (n_estimators=100, n_jobs=-1, random_state = 49)
RF_clf200 = RandomForestClassifier (n_estimators=200, n_jobs=-1, random_state = 49)
RF_clf300 = RandomForestClassifier (n_estimators=300, n_jobs=-1, random_state = 49)
RF_clf400 = RandomForestClassifier (n_estimators=400, n_jobs=-1, random_state = 49)
RF_clf500 = RandomForestClassifier (n_estimators=500, n_jobs=-1, random_state = 49)
RF_clf600 = RandomForestClassifier (n_estimators=600, n_jobs=-1, random_state = 49)
param_name = "max_features"
param_range = param_range = [5, 10, 15, 20, 25, 30]
plt.figure()
plt.suptitle('n_estimators = 100', fontsize=14, fontweight='bold')
_, test_scores = validation_curve(RF_clf100, X, y, cv=logo.split(X, y, groups=gp),
param_name=param_name, param_range=param_range,
scoring=Fscorer, n_jobs=-1)
test_scores_mean = np.mean(test_scores, axis=1)
plt.plot(param_range, test_scores_mean)
plt.xlabel(param_name)
plt.xlim(min(param_range), max(param_range))
plt.ylabel("F1")
plt.ylim(0.47, 0.57)
plt.legend(loc="best")
plt.show()
plt.figure()
plt.suptitle('n_estimators = 200', fontsize=14, fontweight='bold')
_, test_scores = validation_curve(RF_clf200, X, y, cv=logo.split(X, y, groups=gp),
param_name=param_name, param_range=param_range,
scoring=Fscorer, n_jobs=-1)
test_scores_mean = np.mean(test_scores, axis=1)
plt.plot(param_range, test_scores_mean)
plt.xlabel(param_name)
plt.xlim(min(param_range), max(param_range))
plt.ylabel("F1")
plt.ylim(0.47, 0.57)
plt.legend(loc="best")
plt.show()
...
...
What I would really like though is to combine the LOGO with grid search, or randomized search, for a more thorough parameter space search.
As of now my code looks like this:
param_dist = {"n_estimators": [100, 200, 300, 400, 500, 600],
"max_features": sp_randint(5, 30),
"max_depth": sp_randint(2, 18),
"criterion": ['entropy', 'gini'],
"min_samples_leaf": sp_randint(2, 17)}
clf = RandomForestClassifier(random_state = 49)
n_iter_search = 45
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
n_iter=n_iter_search,
scoring=Fscorer, cv=8,
n_jobs=-1)
random_search.fit(X, y)
When I replace cv = 8
with cv=logo.split(X, y, groups=gp)
, I get this error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-0092e11ffbf4> in <module>()
---> 35 random_search.fit(X, y)
/Applications/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_search.pyc in fit(self, X, y, groups)
1183 self.n_iter,
1184 random_state=self.random_state)
-> 1185 return self._fit(X, y, groups, sampled_params)
/Applications/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_search.pyc in _fit(self, X, y, groups, parameter_iterable)
540
541 X, y, groups = indexable(X, y, groups)
--> 542 n_splits = cv.get_n_splits(X, y, groups)
543 if self.verbose > 0 and isinstance(parameter_iterable, Sized):
544 n_candidates = len(parameter_iterable)
/Applications/anaconda/lib/python2.7/site-packages/sklearn/model_selection/_split.pyc in get_n_splits(self, X, y, groups)
1489 Returns the number of splitting iterations in the cross-validator.
1490 """
-> 1491 return len(self.cv) # Both iterables and old-cv objects support len
1492
1493 def split(self, X=None, y=None, groups=None):
TypeError: object of type 'generator' has no len()
Any suggestions as to (1) what is happening and, more importantly, (2) how I can make it work (combining RandomizedSearchCV with LeaveOneGroupOut)?
* UPDATE Feb. 08 2017*
It worked using cv=logo
with @Vivek Kumar' suggestion of random_search.fit(X, y, wells)
You should not pass logo.split()
into the RandomizedSearchCV, only pass a cv
object like logo
into it. The RandomizedSearchCV internally calls split()
to generate train test indices.
You can pass your gp
groups into the fit()
call to RandomizedSearchCV
or GridSearchCV
object.
Instead of doing this:
random_search.fit(X, y)
Do this:
random_search.fit(X, y, gp)
EDIT: you can also pass gp to the constructor of GridSearchCV or RandomizedSearchCV in the parameter fit_params
as a dict.