I am trying to use Sklearn PCA with the following code to reduce my 5000-D data to 32-D
from sklearn.decomposition import PCA
import numpy as np
arr = np.random.randint(1,10,(10,5000)).astype(float)
pca = PCA(n_components=32)
newData = pca.fit_transform(arr)
print newData.shape
With the code above I got newData of shape (10,10) (10 samples of 10-dimensional). What I understand about PCA is that it should results in newData of shape (10,32) but its not the case here. Changing the input data (arr) to have 50 samples, I got newData of shape (50,32) which is what I expected. It seems that sklearn automatically set n_components to min(num_samples,num_dimension) if that value is smaller than the given n_components (32 in this case).
Could anyone tell me what is the purpose of this?
There are simply not enough data to calculate all the components you asked for.
Or, said differently: They would be arbitrary, and their associate variance equal to 0, because the feature covariance matrix is of rank at most 10 (you would need rank 32 to be able to get 32 components).
So scikit-learn just doesn't return them.