I'm using sklearn.decomposition.PCA
to pre-process some training data for a machine learning model. There is 247 data points with 4095 dimensions, imported from a csv
file using pandas
. I then scale the data
training_data = StandardScaler().fit_transform(training[:,1:4096])
before calling the PCA
algorithm to obtain the variance for each dimension,
pca = PCA(n_components)
pca.fit(training_data)
.
The output is a vector of length 247, but it should have length 4095 so that I can work out the variance of each dimension, not the variance of each data point.
My code looks like:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
test = np.array(pd.read_csv("testing.csv", sep=','))
training = np.array(pd.read_csv("training.csv", sep=','))
# ID Number = [0]
# features = [1:4096]
training_data = StandardScaler().fit_transform(training[:,1:4096])
test_data = StandardScaler().fit_transform(test[:,1:4096])
training_labels = training[:,4609]
pca = PCA()
pca.fit(training_data)
pca_variance = pca.explained_variance_.
I have tried taking the transpose of training_data
, but this didn't change the output. I have also tried changing n_components
in the argument of the PCA
function, but it is insistent that there can only be 247 dimensions.
This may be a stupid question, but I'm very new to this sort of data processing. Thank you.
You said:
" but it should have length 4095 so that I can work out the variance of each dimension, not the variance of each data point."
No. This is only true if you would estimate 4095 components using pca = PCA(n_components=4095)
.
On the other hand, you define:
pca = PCA() # this is actually PCA(n_components=None)
so n_components
is set to None
.
When this happens we have (see the documentation here):
n_components == min(n_samples, n_features)
Thus, in your case, you have min(247, 4095) = 247
components.
So, pca.explained_variance_.
will be a vector with shape 247 since you have 247 PC dimensions.
Why do we have n_components == min(n_samples, n_features)
?
This is related to the rank of the covariance/correlation matrix. Having a data matrix X
with shape [247,4095]
, the covariance/correlation matrix would be [4095,4095]
with max rank = min(n_samples, n_features). Thus, you have at most min(n_samples, n_features) meaningful PC components/dimensions.