machine-learning scikit-learn random-forest regularized

How to tune Sklearn's RandomForest? max_depth Vs min_samples_leaf

max_depth VS min_samples_leaf

The parameters max_depth and min_samples_leaf are confusing me the most during a multiple attempts of using GridSearchCV. To my understanding both of these parameters are a way of controlling the depth of the trees, please correct me if I'm wrong.

max_features

I'm doing a very simple classification task and changing min_samples_leaf seems to have no effect on the AUC score; however, tuning the depth improves my AUC from 0.79 to 0.84, pretty drastic. Nothing else seem to affect it as well. I thought the main thing I should tune is max_features, however, best result value is not far of from sqrt(n_features).

scoring='roc_auc'

Another issue, I noticed if all the parameters are fixed while changing the number of trees, GridSearchCV will always select the highest number of trees. This is understandable but the AUC slightly drops for some reason even though scoring='roc_auc'. why is this happing? does it consider the oob_score instead.

Please feel free to share any resource that can be helpful in understanding how random forests can systematically be tuned as it seems there are few related parameters affecting each other.

Solution

As you increase max depth you increase variance and decrease bias. On the other hand, as you increase min samples leaf you decrease variance and increase bias.

So, these parameters will control the level of regularization when growing the trees. In summary, decreasing any of the max* parameters and increasing any of the min* parameters will increase regularization.

Secondly, it's hard to say why your accuracy is dropping. You might want to try nested CV to get a sense of the range of accuracies the best_params_ exhibit when generalizing to unseen data.