Search code examples
pythonmathmatrixpca

Apply pca to the test data


I am trying to perform the python implementation of PCA using sklearn. I have created the following function:

def dimensionality_reduction(train_dataset_mod1, train_dataset_mod2, test_dataset_mod1, test_dataset_mod2):

  pca = PCA(n_components= 200)
  pca.fit(train_dataset_mod1.transpose())
  mod1_features_train = pca.components_
  pca2 = PCA(n_components=200)
  pca2.fit(train_dataset_mod2.transpose())
  mod2_features_train = pca2.components_
  mod1_features_test = pca.transform(test_dataset_mod1)
  mod2_features_test = pca2.transform(test_dataset_mod2)

  return mod1_features_train.transpose(), mod2_features_train.transpose(), mod1_features_test, mod2_features_test

The size of my matrices are the following:

train_dataset_mod1 733x5000
test_dataset_mod1 360x5000
mod1_features_train 200x733
train_dataset_mod2 733x8000
test_dataset_mod2 360x8000
mod2_features_train 200x733

However when I am trying to run the whole script I am receiving the following message:

File "\Anaconda2\lib\site-packages\sklearn\decomposition\base.py", line 132, in transform X = X - self.mean_

What is the issue? How can I apply the pca to the test data?

Here an example of the debugging of pca for mod1:

enter image description here

The transformed dataset mod1_features_train and mod1_features_train having the correct size both 500x733. However I cannot do the same with test_dataset_mod1 and test_dataset_mod2, why?

EDIT: During the debugging I noticed that the base.py file of pca, there is an operation X = X - self.mean where X is my test data and self_mean the mean calculated from the fit into the train set (the size of the slf_mean is 733 which does not match with the X). If i remove the transpose() in the training process the pca is working normally without errors, the test_dataset_mod1 and test_dataset_mod2 having correct size 360x500, however, the train_dataset_mod1 and train_dataset_mod2 having wrong sizes 5000x500???


Solution

  • you shouldn't have transpose your matrix in in fit function or if you have to , you have to transpose your matrix in the transform function :

    pca.fit(train_dataset_mod1)
      pca2.fit(train_dataset_mod2)
      mod1_features_test = pca.transform(test_dataset_mod1)
      mod2_features_test = pca2.transform(test_dataset_mod2)
    

    or :

    pca.fit(train_dataset_mod1.transpose())
      pca2.fit(train_dataset_mod2.transpose())
      mod1_features_test = pca.transform(test_dataset_mod1.transpose())
      mod2_features_test = pca2.transform(test_dataset_mod2.transpose())