I am trying to perform the python implementation of PCA using sklearn. I have created the following function:
def dimensionality_reduction(train_dataset_mod1, train_dataset_mod2, test_dataset_mod1, test_dataset_mod2):
pca = PCA(n_components= 200)
pca.fit(train_dataset_mod1.transpose())
mod1_features_train = pca.components_
pca2 = PCA(n_components=200)
pca2.fit(train_dataset_mod2.transpose())
mod2_features_train = pca2.components_
mod1_features_test = pca.transform(test_dataset_mod1)
mod2_features_test = pca2.transform(test_dataset_mod2)
return mod1_features_train.transpose(), mod2_features_train.transpose(), mod1_features_test, mod2_features_test
The size of my matrices are the following:
train_dataset_mod1 733x5000
test_dataset_mod1 360x5000
mod1_features_train 200x733
train_dataset_mod2 733x8000
test_dataset_mod2 360x8000
mod2_features_train 200x733
However when I am trying to run the whole script I am receiving the following message:
File "\Anaconda2\lib\site-packages\sklearn\decomposition\base.py", line 132, in transform X = X - self.mean_
What is the issue? How can I apply the pca to the test data?
Here an example of the debugging of pca for mod1:
The transformed dataset mod1_features_train and mod1_features_train having the correct size both 500x733. However I cannot do the same with test_dataset_mod1 and test_dataset_mod2, why?
EDIT: During the debugging I noticed that the base.py file of pca, there is an operation X = X - self.mean where X is my test data and self_mean the mean calculated from the fit into the train set (the size of the slf_mean is 733 which does not match with the X). If i remove the transpose() in the training process the pca is working normally without errors, the test_dataset_mod1 and test_dataset_mod2 having correct size 360x500, however, the train_dataset_mod1 and train_dataset_mod2 having wrong sizes 5000x500???
you shouldn't have transpose your matrix in in fit function or if you have to , you have to transpose your matrix in the transform function :
pca.fit(train_dataset_mod1)
pca2.fit(train_dataset_mod2)
mod1_features_test = pca.transform(test_dataset_mod1)
mod2_features_test = pca2.transform(test_dataset_mod2)
or :
pca.fit(train_dataset_mod1.transpose())
pca2.fit(train_dataset_mod2.transpose())
mod1_features_test = pca.transform(test_dataset_mod1.transpose())
mod2_features_test = pca2.transform(test_dataset_mod2.transpose())