Search code examples
pythonscikit-learnpca

How to get contributions and squared cosines in sklearn PCA?


Working primarily based on this paper I want to implement the various PCA interpretation metrics mentioned - for example cosine squared and what the article calls contribution.

However the nomenclature here seems very confusing, namely it's not clear to me what exactly sklearns pca.components_ is. I've seen some answers here and in various blogs stating that these are loadings while others state it's component scores (which I assume is the same thing as factor scores).

The paper defines contribution (of observation to component) as:

ctr

and states all contributions for each component must add to 1, which is not the case assuming pca.explained_variance_ is the eigenvalues and pca.components_ are the factor scores:

df = pd.DataFrame(data = [
[0.273688,0.42720,0.65267],
[0.068685,0.008483,0.042226],
[0.137368, 0.025278,0.063490],
[0.067731,0.020691,0.027731],
[0.067731,0.020691,0.027731]
], columns = ["MeS","EtS", "PrS"])

pca = PCA(n_components=2)
X = pca.fit_transform(df)
ctr=(pd.DataFrame(pca.components_.T**2)).div(pca.explained_variance_)
np.sum(ctr,axis=0)
# Yields random values 0.498437 and 0.725048

How can I calculate these metrics? The paper defines cosine squared similarly as:

ctr


Solution

  • This paper does not play well with sklearn as far as definitions are concerned.

    The pca.components_ are the two principal components of your data after your data is centered. And pca.fit_transform(df) gives you the components of your centered data set w.r.t. those two principal components, i.e., the factor scores.

    > pca.fit_transform(df)
    array([[ 0.60781787, -0.00280834],
           [-0.1601333 , -0.01246807],
           [-0.11667497,  0.04584743],
           [-0.1655048 , -0.01528551],
           [-0.1655048 , -0.01528551]])
    

    Next, the lambda_l of equation (10) in the paper is just the sum of the squares of the factor scores for the l-th component, i.e. l-th column of pca.fit_transform(df). But pca.explained_variance_ gives you the two variances, and since sklearn uses as degrees of freedom the value len(df.index) - 1, we have lambda_l == (len(df.index)-1) pca.explained_variance_[l].

    > X = pca.fit_transform(df)
    > lmbda = np.sum(X**2, axis = 0)
    > lmbda
    array([0.46348196, 0.00273262])
    
    > (5-1) * pca.explained_variance_
    array([0.46348196, 0.00273262])
    

    Thus, as a summary, I recommend computing the contributions as:

    > ctr = X**2 / np.sum(X**2, axis = 0)
    

    For the squared cosine it's the same except that we sum over the rows of pca.fit_transform(df):

    > cos_sq = X**2 / np.sum(X**2, axis = 1)[:, np.newaxis]