Search code examples
pythonregressionfeature-selection

Why does R2-value increase after feature-reduction with RFE?


For an exploratory semester project, I am trying to predict the outcome value of a quality control measurement using various measurements made during production. For the project I was testing different algorithms (LinearRegression, RandomForestRegressor, GradientBoostingRegressor, ...). I generally get rather low r2-values (around 0.3), which is probably due to the scattering of the feature values and not my real problem here.
Initially, I have around 100 features, which I am trying to reduce using RFE with LinearRegression() as estimator. Cross validation indicates, I should reduce my features to only 60 features. However, when I do so, for some models the R2-value increases. How is that possible? I was under the impression that adding variables to the model always increases R2 and thus reducing the number of variables should lead to lower R2 values.
Can anyone comment on this or provide an explanation?

Thanks in advance.


Solution

  • It depends on whether you are using the testing or training data to measure R2. This is a measure of how much of the variance of the data your model captures. So, if you increase the number of predictors then you are correct in that you do a better job predicting exactly where the training data lie and thus your R2 should increase (converse is true for decreasing the number of predictors).

    However, if you increase number of predictors too much you can overfit to the training data. This means the variance of the model is actually artificially high and thus your predictions on the test set will begin to suffer. Therefore, by reducing the number of predictors you actually might do a better job of predicting the test set data and thus your R2 should increase.