Search code examples
pythonscikit-learnpcadecomposition

PCA().fit() is using the wrong axis for data input


I'm using sklearn.decomposition.PCA to pre-process some training data for a machine learning model. There is 247 data points with 4095 dimensions, imported from a csv file using pandas. I then scale the data

training_data = StandardScaler().fit_transform(training[:,1:4096])

before calling the PCA algorithm to obtain the variance for each dimension,

pca = PCA(n_components)

pca.fit(training_data).

The output is a vector of length 247, but it should have length 4095 so that I can work out the variance of each dimension, not the variance of each data point.

My code looks like:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

test = np.array(pd.read_csv("testing.csv", sep=','))
training = np.array(pd.read_csv("training.csv", sep=','))
# ID Number = [0]
# features = [1:4096]

training_data = StandardScaler().fit_transform(training[:,1:4096])
test_data = StandardScaler().fit_transform(test[:,1:4096])
training_labels = training[:,4609]

pca = PCA()
pca.fit(training_data)
pca_variance = pca.explained_variance_.

I have tried taking the transpose of training_data, but this didn't change the output. I have also tried changing n_components in the argument of the PCA function, but it is insistent that there can only be 247 dimensions.

This may be a stupid question, but I'm very new to this sort of data processing. Thank you.


Solution

  • You said:

    " but it should have length 4095 so that I can work out the variance of each dimension, not the variance of each data point."

    No. This is only true if you would estimate 4095 components using pca = PCA(n_components=4095).


    On the other hand, you define:

    pca = PCA() # this is actually PCA(n_components=None)
    

    so n_components is set to None.


    When this happens we have (see the documentation here):

    n_components == min(n_samples, n_features)

    Thus, in your case, you have min(247, 4095) = 247 components.

    So, pca.explained_variance_. will be a vector with shape 247 since you have 247 PC dimensions.


    Why do we have n_components == min(n_samples, n_features) ?

    This is related to the rank of the covariance/correlation matrix. Having a data matrix X with shape [247,4095], the covariance/correlation matrix would be [4095,4095] with max rank = min(n_samples, n_features). Thus, you have at most min(n_samples, n_features) meaningful PC components/dimensions.