Search code examples
pythonscikit-learnpca

PCA matrix with sklearn


I did PCA on some data and I want to extract the PCA matrix. This is my code (excluding loading the data):

from sklearn.decomposition import PCA
pca = PCA(n_components=5)
pca_result = pca.fit_transform(recon.data.cpu().numpy())
M = pca.components_

I thought that M should be the PCA matrix. However when I print pca_result (first few rows) I get this:

[-21.08167   ,  -5.67821   ,   0.17554353,  -0.732398  ,0.04658243],
[-25.936056  ,  -6.535223  ,   0.6887493 ,  -0.8394666 ,0.06557591],
[-30.755266  ,  -6.0098953 ,   1.1643354 ,  -0.82322127,0.07585468]

But when I print np.transpose(np.matmul(M,np.transpose(recon))), I get this:

[-27.78438   ,  -2.5913327 ,   0.87771094,  -1.0819707 ,0.1037216 ],
[-32.63887   ,  -3.4483302 ,   1.3909296 ,  -1.1890743 ,0.12274324],
[-37.45802   ,  -2.9229708 ,   1.8665184 ,  -1.1728177 ,0.13301012]

What am I doing wrong and how do I get the actual PCA matrix? Thank you!


Solution

  • in a PCA you go from an n-dimensional space to a different (rotated) n-dimensional space. This change is done using an nxn matrix

    This is indeed the matrix returned by pca.components_; when multiplied by the PCA-transformed data it gives the reconstruction of the original data X.

    Here is a demonstration with the iris data:

    import numpy as np
    from sklearn.decomposition import PCA
    from sklearn.datasets import load_iris
    
    X = load_iris().data
    mu = np.mean(X, axis=0) # mean value
    
    pca = PCA()
    X_pca = pca.fit_transform(X)
    M = pca.components_
    M
    # result:
    array([[ 0.36138659, -0.08452251,  0.85667061,  0.3582892 ],
           [ 0.65658877,  0.73016143, -0.17337266, -0.07548102],
           [-0.58202985,  0.59791083,  0.07623608,  0.54583143],
           [-0.31548719,  0.3197231 ,  0.47983899, -0.75365743]])
    

    i.e. a 4x4 matrix indeed (the iris data have 4 features).

    Let's reconstruct the original data using all PCs:

    X_hat = np.matmul(X_pca, M)
    X_hat = X_hat + mu # add back the mean
    print(X_hat[0]) # reconstructed
    print(X_[0])    # original
    

    Result:

    [5.1 3.5 1.4 0.2]
    [5.1 3.5 1.4 0.2]
    

    i.e. perfect reconstruction.

    Reconstructing with fewer PCs, let's say 2 (out of 4):

    n_comp = 2
    X_hat2 = np.matmul(X_pca[:,:n_comp], pca.components_[:n_comp,:])
    X_hat2 = X_hat2 + mu
    print(X_hat2[0])
    

    Result:

    [5.08303897 3.51741393 1.40321372 0.21353169]
    

    i.e. a less accurate reconstruction, as we should expect due to the truncation in used PCs (2 instead of all 4).

    (Code adapted from the great thread How to reverse PCA and reconstruct original variables from several principal components? at Cross Validated.)