Search code examples
pythonnumpyscikit-learnpca

Why does sklearn and numpy disagree about multiplying component of PCA?


from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA

SEED = 123
X, y = make_blobs(n_samples=1000, n_features=5000, cluster_std=90., random_state=SEED)
pca = PCA(2)
pca.fit(X)
pca1, pca2 = pca.components_

pcaX = pca.transform(X)
pcaXnp = np.array([X @ pca1, X @ pca2]).T

And if you print out pcaX and pcaXnp you'll see that they're similar but that they don't agree with each other. Why should these differ? It seems like ".components_" should return what sklearn is going to multiply the matrix by, is there a reason why it's just an approximation of what the multiplication will be?


Solution

  • PCA from sklearn.decomposition uses singular value decomposition, or SVD to obtain your principal components. This works only when the columns have first been centered by their means. If you check the source code, they do the centering before SVD:

    def _fit_full(self, X, n_components):
    [...]
            # Center data
            self.mean_ = np.mean(X, axis=0)
            X -= self.mean_
    

    So to get the PCA scores, you need to center your matrix first:

    pcaX = pca.transform(X)
    Xc = X - X.mean(axis=0)
    pcaXnp = np.array([Xc @ pca1, Xc @ pca2]).T
    
    pcaX[:3]
    array([[-101.45177987,  212.45583745],
           [ 520.84541298,   87.32254399],
           [-273.26407231, -318.78493994]])
    
    pcaXnp[:3]
    array([[-101.45177987,  212.45583745],
           [ 520.84541298,   87.32254399],
           [-273.26407231, -318.78493994]])