Search code examples
pythonscikit-learnpca

PCA using sklearn


I have a large input matrix, size (20, 20000) and am trying to perform PCA using the sklearn Python package. Here, 20 refers to 20 subjects, and 20,000 refers to 20,000 features. Below is sample code:

import numpy as np
from sklearn.decomposition import PCA

rng = np.random.RandomState(1)
X = rng.randn(20, 20000)
pca.fit(X)
X.shape = 

>> (20, 20000)

pca = PCA(n_components=21)
pca.fit(X)
X_pca = pca.transform(X)
print("Original shape: ", X.shape)
print("Transformed shape: ", X_pca.shape)

>> Original shape: (20, 20000)
>> Transformed shape: (20, 20)

Using PCA, am I not able to get back more components than my number of x values(why are we limited by the length of our x-values when we obtain pca components)?


Solution

  • This has more to do with the PCA implementation than sklearn, but:

    if n_samples <= n_features:
        maxn_pc = n_samples - 1
    else:
        maxn_pc = n_features
    

    Namely, if your number of samples (n) is less than or equal the number of features (f), the greatest number of non-trivial components you can extract is n-1. Otherwise, the greatest number of non-trivial components is n.