Search code examples
scikit-learnpca

How to find most contributing features to PCA?


I am running PCA on my data (~250 features) and see that all points are clustered in 3 blobs.

Is it possible to see which of the 250 features have been most contributing to the outcome? if so how?

(using the Scikit-learn implementation)


Solution

  • Let's see what wikipedia says:

    PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

    To get how 'influent' are vectors from original space in the smaller one you have to project them as well. Which is done by:

    res = pca.transform(np.eye(D))
    
    • np.eye(n) creates a n x n diagonal matrix (one on diagonal, 0 otherwise).
    • Thus, np.eye(D) is your features in original feature space
    • res is the projection of your features in lower space.

    The interesting thing is that res is a D x d matrix where res[i][j] represent "how much feature i contribute to component j"

    Then, you may just sum over columns to get a D x 1 matrix (call it contributiion where each contribution[i] is the total contribution of feature i.

    Sort it and you find the most contributing feature :)

    Not sure its clear, could add any kind of additional information.

    Hope this helps, pltrdy