I'm training a sklearn.ensemble.RandomForestClassifier()
on a single cluster node that has 28 CPUs and ~190GB RAM. Training this classifier alone runs quite fast, uses all cores on the machine and uses ~93GB RAM:
x_train,x_test,y_train,y_test=sklearn.model_selection.train_test_split(x,y,test_size=.25,random_state=0)
clf=sklearn.ensemble.RandomForestClassifier(n_estimators=100,
random_state=0,
n_jobs=-1,
max_depth=10,
class_weight='balanced',
warm_start=False,
verbose=2)
clf.fit(x_train,y_train)
with output:
[Parallel(n_jobs=-1)]: Done 88 out of 100 | elapsed: 1.9min remaining: 15.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 2.0min finished
CPU times: user 43min 10s, sys: 1min 33s, total: 44min 44s
Wall time: 2min 20s
However, this particular model seems not optimal, having performance ~80% correct. So I want to optimize hyperparameters for the model using sklearn.model_selection.RandomizedSearchCV()
. So I setup the search like so:
rfc = sklearn.ensemble.RandomForestClassifier()
rf_random = sklearn.model_selection.RandomizedSearchCV(estimator=rfc,
param_distributions=random_grid,
n_iter=100,
cv=3,
verbose=2,
random_state=0,
n_jobs=2,
pre_dispatch=1)
rf_random.fit(x, y)
But I cannot find setting for n_jobs
and pre_dispatch
that uses the hardware effectively. Here are some example runs and the results:
n_jobs pre_dispatch Result
===========================================================================
default default Utilizes all cores but Job killed - out of memory
-1 1 Job killed - out of memory
12 1 Job killed - out of memory
3 1 Job killed - out of memory
2 1 Job runs, but only utilizes 2 cores, takes >230min (wall clock) per model
How can I get the performance that I see when training a standalone RandomForestClassifier
when running a hyperparameter search? And how is the standalone version parallelizing such that it does not create copies of my large dataset like with the grid search?
EDIT:
The following combination of parameters effectively used all cores for training each individual RandomForestClassifier
without parallelizing the hyperparameter search itself or blowing up the RAM usage.
model = sklearn.ensemble.RandomForestClassifier(n_jobs=-1, verbose=1)
search = sklearn.model_selection.RandomizedSearchCV(estimator=model,
param_distributions=random_grid,
n_iter=10,
cv=3,
verbose=10,
random_state=0,
n_jobs=1,
pre_dispatch=1)
with joblib.parallel_backend('threading'):
search.fit(x, y)
If the single classifier training saturates all your cores, then there is nothing to gain by parallelizing the gridsearch also. Set n_jobs=1 for gridsearch, and keep n_jobs=-1 for the classifier. This should avoid the out of memory condition.