I'm running a relatively large job, which involves doing a randomized grid search on a dataset, which (with a small n_iter_search) already takes a long time.
I'm running it on a 64 core machine, and for about 2 hours it kept 2000 threads active working on the first folds. It then stopped reporting completely into the stdout. It's last report was:
[Parallel(n_jobs=-1)]: Done 4 out of 60 | elapsed: 84.7min remaining: 1185.8min
I've noticed on htop that almost all cores are at 0%, which would not happen when training random forests. No feedback or errors from the program, if it weren't for htop I would assume it is still training. This has happened before, so it is a recurring problem. The machine is perfectly responsive and the process seems alive.
I already have verbose = 10. Any thoughts on how I can diagnose what it going on inside the RandomizedSearchCV?
The grid search I'm doing:
rfc = RandomForestClassifier(n_jobs=-1)
param_grid = { 'n_estimators': sp_randint(100, 5000), 'max_features' : ['auto', None], 'min_samples_split' : sp_randint(2, 6) }
n_iter_search = 20
CV_rfc = RandomizedSearchCV(estimator=rfc, param_distributions=param_grid, n_iter = n_iter_search, verbose = 10,n_jobs = -1)
As a first step, adding the verbose
parameter to the RandomForestClassifier
as well could let you see if the search is really stuck. It will display progress in fitting the trees (building tree 88 out of 100
...).
I don't really know why your search got stuck, but thinking about it removing the search on n_estimators
should enable you to grid search the entire space of parameters you specified here in just 8 iterations.