Search code examples
pythonpcasvd

Performing svd by sklearn.decomposition.PCA , how can I get the U S V from this?


I perform SVD with sklearn.decomposition.PCA

From the equation of the SVD

A= U x S x V_t

V_t = transpose matrix of V (Sorry I can't paste the original equation)

If I want the matrix U, S, and V, how can I get it if I use the sklearn.decomposition.PCA ?


Solution

  • First of all, depending on the size of your matrix, sklearn implementation of PCA will not always compute the full SVD decomposition. The following is taken from PCA's GitHub reciprocity:

    svd_solver : string {'auto', 'full', 'arpack', 'randomized'}
            auto :
                the solver is selected by a default policy based on `X.shape` and
                `n_components`: if the input data is larger than 500x500 and the
                number of components to extract is lower than 80% of the smallest
                dimension of the data, then the more efficient 'randomized'
                method is enabled. Otherwise the exact full SVD is computed and
                optionally truncated afterwards.
            full :
                run exact full SVD calling the standard LAPACK solver via
                `scipy.linalg.svd` and select the components by postprocessing
            arpack :
                run SVD truncated to n_components calling ARPACK solver via
                `scipy.sparse.linalg.svds`. It requires strictly
                0 < n_components < X.shape[1]
            randomized :
                run randomized SVD by the method of Halko et al.
    

    In addition, it also performs some manipulations on the data (see here).

    Now, if you want to get U, S, V that are used in sklearn.decomposition.PCA you can use pca._fit(X). For example:

    from sklearn.decomposition import PCA
    X = np.array([[1, 2], [3,5], [8,10], [-1, 1], [5,6]])
    pca = PCA(n_components=2)
    pca._fit(X)
    

    prints

    (array([[ -3.55731195e-01,   5.05615563e-01],
            [  2.88830295e-04,  -3.68261259e-01],
            [  7.10884729e-01,  -2.74708608e-01],
            [ -5.68187889e-01,  -4.43103380e-01],
            [  2.12745524e-01,   5.80457684e-01]]),
     array([ 9.950385  ,  0.76800941]),
     array([[ 0.69988535,  0.71425521],
            [ 0.71425521, -0.69988535]]))
    

    However, if you just want the SVD decomposition of the original data, I would suggest to use scipy.linalg.svd