Using sklearn to run a grid search on a random forest classifier. This has been running for longer than I thought, and I am trying to estimate how much time is left for this process. I thought the total number of fits it would do would be 3*3*3*3*5 = 405.
clf = RandomForestClassifier(n_jobs=-1, oob_score=True, verbose=1)
param_grid = {'n_estimators':[50,200,500],
'max_depth':[2,3,5],
'min_samples_leaf':[1,2,5],
'max_features': ['auto','log2','sqrt']
}
gscv = GridSearchCV(estimator=clf,param_grid=param_grid,cv=5)
gscv.fit(X.values,y.values.reshape(-1,))
From the output, I see it cycling through the tasks where each set is the number of estimators:
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.2min
[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 5.3min
[Parallel(n_jobs=-1)]: Done 200 out of 200 tasks | elapsed: 6.2min finished
[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.5s
[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 3.0s
[Parallel(n_jobs=8)]: Done 200 tasks out of 200 tasks | elapsed: 3.2s finished
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.1min
[Parallel(n_jobs=-1)]: Done 50 tasks out of 50 tasks | elapsed: 1.5min finished
[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.5s
[Parallel(n_jobs=8)]: Done 50 out of 50 tasks | elapsed: 0.8s finished
I counted up the number of "finished" and it is at 680 currently. I thought it would be done at 405. Is my calculation wrong?
Your calculation seems correct: the number of grids is the combinatoric product of the different parameters, which in this case is 81:
>>> from sklearn.model_selection import ParameterGrid
>>> pg = ParameterGrid(param_grid)
>>> len(pg)
81
Within each, you have five cross-validations, for a total of 405. The tasks
is a separate indication entirely.
verbose
gets passed through to a parent class BaseForest
, and subsequently to joblib's Parallel
.
I'm not precisely sure what constitutes a task in this case, but the number of top-level grid-train combinations should be 405. Keep in mind each of these is in turn an ensemble of trees.