Search code examples
pythonmachine-learningscikit-learnrandom-forestgrid-search

How many combinations will GridSearchCV run for this?


Using sklearn to run a grid search on a random forest classifier. This has been running for longer than I thought, and I am trying to estimate how much time is left for this process. I thought the total number of fits it would do would be 3*3*3*3*5 = 405.

clf = RandomForestClassifier(n_jobs=-1, oob_score=True, verbose=1)
param_grid = {'n_estimators':[50,200,500],
'max_depth':[2,3,5],
'min_samples_leaf':[1,2,5],
'max_features': ['auto','log2','sqrt']
}

gscv = GridSearchCV(estimator=clf,param_grid=param_grid,cv=5)
gscv.fit(X.values,y.values.reshape(-1,))

From the output, I see it cycling through the tasks where each set is the number of estimators:

[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.2min
[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 5.3min
[Parallel(n_jobs=-1)]: Done 200 out of 200 tasks | elapsed: 6.2min finished
[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.5s
[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 3.0s
[Parallel(n_jobs=8)]: Done 200 tasks out of 200 tasks | elapsed: 3.2s finished
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.1min
[Parallel(n_jobs=-1)]: Done 50 tasks out of 50 tasks | elapsed: 1.5min finished
[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.5s
[Parallel(n_jobs=8)]: Done 50 out of 50 tasks | elapsed: 0.8s finished

I counted up the number of "finished" and it is at 680 currently. I thought it would be done at 405. Is my calculation wrong?


Solution

  • Your calculation seems correct: the number of grids is the combinatoric product of the different parameters, which in this case is 81:

    >>> from sklearn.model_selection import ParameterGrid
    
    >>> pg = ParameterGrid(param_grid)
    >>> len(pg)
    81
    

    Within each, you have five cross-validations, for a total of 405. The tasks is a separate indication entirely.

    verbose gets passed through to a parent class BaseForest, and subsequently to joblib's Parallel.

    I'm not precisely sure what constitutes a task in this case, but the number of top-level grid-train combinations should be 405. Keep in mind each of these is in turn an ensemble of trees.