Search code examples
random-forest

Oob_score related questions from error messages to how to actually use it


This is going to be a long post. I have so many questions. I am using a data set of (15000,5) plus the dependable variable with 3 classes(5000 data points per class). I use the StratifiedShuffleSplit to divide the data into train(12000,)and test(3000,). I run a random forest classifier through randomsearchcv.

Error message with oob_score

I run the random forest with oob_score = True. But I get the error: "UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable OOB estimates." I have increased the number of n_estimator but nothing works. Do i just ignore it or what can I do?

# Use the random grid to search for best hyperparameters
  param_grid = {'n_estimators':range(1,1000, 1),
         'min_samples_split' : [2, 5, 10],
         'max_depth':range(1, 500, 5),
         'min_samples_leaf' : [1, 2, 4],
         'bootstrap' : [True]}
  RF = RandomForestClassifier(oob_score=True, 
                        random_state=42, 
                        warm_start=True,
                        n_jobs=-1)

  # Random search of parameters, using 3 fold cross validation, 
  # search across 100 different combinations, and use all available cores
  rf_random = RandomizedSearchCV(estimator = RF, param_distributions = param_grid, n_iter = 100, cv = 3, verbose=2)
  # Fit the random search model
  rf_random.fit(X_train, y_train)
  rf_random.best_params_
  rf_random.best_score_, rf_random.best_estimator_.oob_score_

Interpretting and using oob_score

I get a best_score_ of 0.89 and a best_estimator_.oob_score_ of 0.90. Is the oob_score_ 90% correct classification or (1-best_estimator_.oob_score_)which means really bad.

Also, since I am comparing this model to other ml algorythims I am guessing that the accuracy is the best way to be consistent, right?

What can I use the oob_score to test for overfitting and underfitting?


Solution

    • We use random forests to get more robust models compared to using just a single decision tree. When using a range of (1,1000), it is possible to get just a few trees in the forest, which is contrary to the idea of using random forests. You might consider adjusting the range to a range with a higher lower bound, e.g. range(50, 1000, 1)
    • A higher number of trees in the forest might indeed resolve the UserWarning
    • 0.9 represents the correct score (not 1-0.9), which seems to be quite good.
    • The relatively small difference between oob-score and cv-score indicates your model is most likely not overfitting
    • Regarding the metric, please see Comprehensive Guide to Multiclass Classification Metrics for metrics which are more suitable than accuracy for multiclass classification