Search code examples
pythonscikit-learnscipylinear-algebrapca

Usage of explained_variance_ for sklearn.decomposition.PCA


I am working on some principal component analysis and I happened to run into a couple of different ways of getting eigenvectors and eigenvalues.

Specifically, I found that in scipy.sparse.linalg there's an eigs method, and in sklearn.decomposition.PCA(), I can also get the eigenvalues by accessing the explained_variance_ attribute.

However, I've run it a couple of times and I am getting some mismatches in eigenvalues. I understand that it's possible for eigenvectors to be different because they may be scalar multiples, but I don't understand how eigenvalues could also differ.

Here's an example:

import numpy as np
import scipy.sparse.linalg as ll
from sklearn.decomposition import PCA

a = np.array([[0,0,0],[0,0,1],[0,1,0]])
w1, v1 = ll.eigs(a, k=3)
w2 = PCA(n_components=3).fit(a).explained_variance_
w1.real
array([ 1., -1.,  0.])

w2
array([0.5       , 0.16666667, 0.        ])

You'll see that w1 and w2 have different eigenvalues. I'm not sure if I'm misunderstanding some basic linear algebra concepts here, or if something's wrong with my code.


Solution

  • scikit-learn's PCA fit() method takes as input a dataset X of shape (n_samples, n_features) where n_samples is the number of samples and n_features is the number of features, and then decomposes the (n_features, n_features) covariance matrix of X, while scipy's eigs() takes directly the matrix to be decomposed as input.

    This means that in order to obtain similar eigenvalues you should fit scikit-learn's PCA to a dataset X with covariance matrix close to a, see the example below:

    import numpy as np
    import scipy.sparse.linalg as ll
    from sklearn.decomposition import PCA
    from sklearn.datasets import make_spd_matrix
    
    # number of dimensions
    n_dim = 3
    
    # covariance matrix
    a = make_spd_matrix(n_dim=n_dim, random_state=42)
    
    # dataset with given covariance matrix
    np.random.seed(42)
    X = np.random.multivariate_normal(mean=np.zeros(n_dim), cov=a, size=100000)
    
    # decompositions
    w0 = np.linalg.eig(a)[0]
    w1 = ll.eigs(a, k=n_dim, return_eigenvectors=False)
    w2 = PCA(n_components=n_dim).fit(X).explained_variance_
    
    # eigenvalues
    print([format(w, '.3f') for w in np.sort(w0.real)[::-1]])
    print([format(w, '.3f') for w in np.sort(w1.real)[::-1]])
    print([format(w, '.3f') for w in np.sort(w2.real)[::-1]])
    # ['3.616', '0.841', '0.242']
    # ['3.616', '0.841', '0.242']
    # ['3.616', '0.841', '0.242']