Search code examples
pythonscikit-learnrandom-forestoverfitting-underfitting

Why randomforest max depth parameter's validation score not shrink, when overfitting occur


I made randomforest model, and visualized result.

#training code
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

digits = load_digits()
forest_param = {'max_depth': np.arange(1,15),
               'n_estimators': [50, 100, 150, 200, 250, 300, 350, 400]}
forest_classifier = RandomForestClassifier()

forest_grid = GridSearchCV(forest_classifier, forest_param, n_jobs=-1, return_train_score=True, cv=10)
digit_data = digits.data
digit_target = digits.target
forest_grid.fit(digit_data, digit_target)

print("best forest validation score")
forest_grid.best_score_
#visualize code
def plot_search_results(grid, lsi_log_index):
    """
    Params: 
        grid: A trained GridSearchCV object.
    """
    ## Results from grid search
    results = grid.cv_results_
    means_test = results['mean_test_score']
    stds_test = results['std_test_score']
    means_train = results['mean_train_score']
    stds_train = results['std_train_score']

    ## Getting indexes of values per hyper-parameter
    masks=[]
    masks_names= list(grid.best_params_.keys())
    for p_k, p_v in grid.best_params_.items():
        masks.append(list(results['param_'+p_k].data==p_v))

    params=grid.param_grid

    ## Ploting results
    fig, ax = plt.subplots(1,len(params),sharex='none', sharey='all',figsize=(20,5))
    fig.suptitle('Score per parameter')
    fig.text(0.04, 0.5, 'MEAN SCORE', va='center', rotation='vertical')
    pram_preformace_in_best = {}
    for i, p in enumerate(masks_names):
        m = np.stack(masks[:i] + masks[i+1:])
        pram_preformace_in_best
        best_parms_mask = m.all(axis=0)
        best_index = np.where(best_parms_mask)[0]
        x = np.array(params[p])
        y_1 = np.array(means_test[best_index])
        e_1 = np.array(stds_test[best_index])
        y_2 = np.array(means_train[best_index])
        e_2 = np.array(stds_train[best_index])
        ax[i].errorbar(x, y_1, e_1, linestyle='--', marker='o', label='test')
        ax[i].errorbar(x, y_2, e_2, linestyle='-', marker='^',label='train' )
        ax[i].set_xlabel(p.upper())
    for log_scaler in lsi_log_index:
        ax[log_scaler].set_xscale("log")

    plt.legend()
    plt.show()
  
plot_search_results(forest_grid,[])

I want to make validation score shrink, when overfitting occur.
like this SVR_C parameter. Image1
validation score are shrink, when overfitting occur.

But, max depth parameter's validation score not shrink, when over fitting occur. Image2

I'v learned validation score are shrink, overfitting situation occur.
Can you tell me why this situation happen? :)


Solution

  • Well it all depends on the dataset. From your Image2, we can see for your RandomForestClassifier that a max_depth is not overfitting your train set. The depth of your trees are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples (2 by default). Those condition ensure that your trees do not expand to their maximum depth. Therefore your model is not overfitting.

    On the other hand with SVR, a large C parameter will ensure that all samples are correctly classify. Hence the model is overfitting.