Search code examples
python-3.xpandasscikit-learnpca

PCA recover most important features in a dataframe


I am trying to work out how to use PCA to determine the most important features. I think I have done that below.

I am wondering then, how do I pass the most important features, with their original column names (from a pandas dataframe) back into the new dataframe I am creating at the bottom - so I can use that as the new 'lightweight' dataset?

This way, if I set n_components to 10; I would have 10 feature columns (with names) being passed into the new dataframe.

Any ideas?

from sklearn.decomposition import PCA

# PCA (principal component analysis) aims to reduce the number of dimensions in the dataset, without losing those which are very relevant to the model
# it provides a score, you can drop those with poor scores.
X_pc = PCA(n_components=2).fit_transform(train_features)
pd.DataFrame({'PC1': X_pc[:, 0], 'PC2': X_pc[:, 1], 'Y': train_labels.ravel()}).sample(10)

Solution

  • PCA reduced the dimensions to 2 by linearly combining the initial features. After transformation, the output is a matrix with [samples, components] size and thus, it is not possible to create a dataframe since you cannot project back the names/features.

    The important features are the ones that influence more the components and thus, have a large absolute value on the component.

    If you change the code you can get the most important features on the PCs:

    from sklearn.decomposition import PCA
    import pandas as pd
    import numpy as np
    np.random.seed(0)
    
    # 10 samples with 5 features
    train_features = np.random.rand(10,5)
    
    model = PCA(n_components=2).fit(train_features)
    X_pc = model.transform(train_features)
    
    # number of components
    n_pcs= model.components_.shape[0]
    
    # get the index of the most important feature on EACH component
    # LIST COMPREHENSION HERE
    most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]
    
    initial_feature_names = ['a','b','c','d','e']
    
    # get the names
    most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
    
    # LIST COMPREHENSION HERE AGAIN
    dic = {'PC{}'.format(i+1): most_important_names[i] for i in range(n_pcs)}
    
    # build the dataframe
    df = pd.DataFrame(sorted(dic.items()))
    

    This prints:

         0  1
     0  PC1  e
     1  PC2  d
    

    So on the PC1 the feature named e is the most important and on PC2 the d.

    Nice reading: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f