Search code examples
pythonscikit-learnrandom-foresttraining-datafeature-selection

Error bars in feature selection increase when using more features?


I am following this example to determine feature importance using Random Forests. When using numerous features and only a subset of these features, these are the results I observe, respectively:

All features

Subset of features

Is there a particular reason the error bars increase drastically when using all possible features? Is there any significance to a negative quantity? (Note: the particular labels on the x-axis in the two plots do not necessarily correspond.)


Solution

  • When you are using only the most important features then there is less chance of an error happening (or less chance of the model incorrectly learn a pattern where it shouldn't).

    Without using feature importances

    • There is a high chance that your model is captruing patterns where it shouldn't and hence giving importance to lesser important feature where it shouldn't.
    • Also, Random Forest is an ensemble of decision trees, some might capture the correct feature importances, some might not.
    • The most importance ones have such a high error rate because in some trees, they may be absolutely ignored altogther or given least importance. While some might capture it correctly
    • Hence, you have both ends of the spectrum resulting in such a high error rate.

    Using feature importances

    • You eliminate the least important features successively resulting in the fact that in successive trees, that feature will not be considered at all ( Hence lesser chance of any error happening in feature importance)
    • Doing this successively improves the chances of more imporatant features to be selected again and again for splitting, hence the error margin is comparatively less