Search code examples
pythonpcafeature-selection

Feature selection using backward feature selection in scikit-learn and PCA


i have calculated the scores of all the columns in my DF,which has 312 columns and 650 rows, using PCA with following code:

all_pca=PCA(random_state=4)
all_pca.fit(tt)
all_pca2=all_pca.transform(tt)
plt.plot(np.cumsum(all_pca.explained_variance_ratio_) * 100)  
plt.xlabel('Number of components')
plt.grid(which='both', linestyle='--', linewidth=0.5)
plt.xticks(np.arange(0, 330, step=25))
plt.yticks(np.arange(0, 110, step=10))
plt.ylabel('Explained variance (%)')  
plt.savefig('elbow_plot.png', dpi=1000)

and the result is the following image.

enter image description here

My main goal is to use only important features for Random forest regression, Gradient boosting, OLS regression and LASSO. As you can see, 100 columns describe 95.2% of the variance in my Dataframe.

Can I use this threshold (100 Columns) for backward feature selection?


Solution

  • As you can see, 100 columns describe 95.2% of the variance in my Dataframe.

    The graph tells you that 100 PCA components capture 95% of the variance. These 100 components do not correspond to 100 individual features. Each PCA component is made by combining all features together, which gives you that one component.

    When 100 PCA components capture 95% of the variance, it means that your original 312 columns can be linearly combined into fewer (100) new columns, and you only lose 5% of the information in the process. It's a measure of the intrinsic dimensionality of the feature set.

    Can I use this threshold (100 Columns) for backward feature selection?

    The 100 PCA components that explain 95% don't really tell you which individual features (or how many of them) are important, as each PCA component is a mix of all features. Also, the 95% refers to variability of the features - it doesn't mean the 100 PCA components will be useful for the target.

    Perhaps you could use the 100 components to guide your choice between using forward vs. backward feature selection. In this case, the intrinsic dimensionality of the dataset is closer to 100 than it is to 312, so I'd opt for forward selection as it seems like the number of useful features might be less than the original size.

    If you were to run PCA before feature selection, it would create new features (PCA components) out of the original ones, and in the process you could lose interpretability, as the new features can be messy linear combinations of the original ones.

    One method of identifying useful features is to run forward (or backward) selection on a random forest using the original features, and stop when you hit a score threshold like 95% validation accuracy. Then you can use those selected features for other models.

    Feature selection is relatively time-consuming as it requires lots of repeated model fitting. Permutation importance is another way of identifying key features.