This is going to be a long post. I have so many questions. I am using a data set of (15000,5) plus the dependable variable with 3 classes(5000 data points per class). I use the StratifiedShuffleSplit
to divide the data into train(12000,)and test(3000,). I run a random forest classifier through randomsearchcv
.
Error message with oob_score
I run the random forest with oob_score = True
. But I get the error: "UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable OOB estimates.
" I have increased the number of n_estimator
but nothing works. Do i just ignore it or what can I do?
# Use the random grid to search for best hyperparameters
param_grid = {'n_estimators':range(1,1000, 1),
'min_samples_split' : [2, 5, 10],
'max_depth':range(1, 500, 5),
'min_samples_leaf' : [1, 2, 4],
'bootstrap' : [True]}
RF = RandomForestClassifier(oob_score=True,
random_state=42,
warm_start=True,
n_jobs=-1)
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = RF, param_distributions = param_grid, n_iter = 100, cv = 3, verbose=2)
# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_params_
rf_random.best_score_, rf_random.best_estimator_.oob_score_
Interpretting and using oob_score
I get a best_score_
of 0.89 and a best_estimator_.oob_score_
of 0.90. Is the oob_score_
90% correct classification or (1-best_estimator_.oob_score_
)which means really bad.
Also, since I am comparing this model to other ml algorythims I am guessing that the accuracy is the best way to be consistent, right?
What can I use the oob_score
to test for overfitting and underfitting?
(1,1000)
, it is possible to get just a few trees in the forest, which is contrary to the idea of using random forests. You might consider adjusting the range to a range with a higher lower bound, e.g. range(50, 1000, 1)
UserWarning