Search code examples
pythonscikit-learnrandom-forestauchyperparameters

AUC of Random forest model is lower after tuning parameters using hypergrid search and CV with 10 folds


The AUC value I received without tuning the hyperparameter was higher. I have used the same training data could there be something I am missing here or some valid explanation.

The data is an average of the word embedding of a tweet that is calculated using pretrained GLoVE vectors for tweets with 50 dimensions

Without tuning :

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

AUC- 0.978

Withtuning:

GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=3,
       param_grid={'max_features': ['auto', 'sqrt', 'log2', None], 'bootstrap': [True, False], 'max_depth': [2, 3, 4], 'criterion': ['gini', 'entropy']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
print(cv_rf.best_estimator_)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=4, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

AUC-0.883


Solution

  • I expect 2 possible reasons for this.

    1. Max-depth is set as None in the former model, which means nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples whereas max_depth=4 in the later, which makes the model less flexible.

    Suggestion: you can increase the max-depth range in Grid Search

    1. Number of estimators (n_estimators) is reduced from 100 to 10. This is makes the Ensemble model weaker.

    Suggestion: Increase the Number of estimators or tune the number of estimators as well.