Search code examples
python-3.xpca

How to apply PCA on a dataset and print the relevant features


I have a dataset with 23 rows and 48 columns. I am applying PCA to reduce the number of column dimensions. I use the following codes examples and I see that only 23 are required features:

#first
import numpy as np
from sklearn.decomposition import PCA
pca = PCA().fit(only_features)
plt.figure(figsize=(15,8))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

#second
df_pca = pca.fit_transform(X=only_features)
df_pca = pd.DataFrame(df_pca)
print(df_pca.shape)

However, I would want to know which are the features required. Like for example: If the original dataset had columns A-z and reduced by PCA, then I would want to know which are the features selected.

How to do that?

Thanks for help


Solution

  • Credit to this answer1 & answer2, Sklearn's documentation states that the number of components retained when you don't specify the n_components parameter is min(n_samples, n_features). So min(23, 48) = 23 that's why you required 23 in your case.

    Solution 1: if you use Sklearn library credit to this answer

    • check variance of PCs by: pca.explained_variance_ratio_
    • check importance of PCs by: print(abs( pca.components_ ))
    • using customized function to extract more info about PCs see this answer.

    Solution 2: if you use PCA library documenetation

    # Initialize
    model = pca()
    # Fit transform
    out = model.fit_transform(X)
    
    # Print the top features. The results show that f1 is best, followed by f2 etc
    print(out['topfeat'])
    
    #     PC      feature
    # 0  PC1      f1
    # 1  PC2      f2
    # 2  PC3      f3
    # 3  PC4      f4
    # 4  PC5      f5
    ...
    

    Even you can make a plot of PCs by: model.plot()

    img