Search code examples
pythonnumpyscikit-learnpcadimensionality-reduction

How to decide whether to use train data or test data when using PCA?


I am new to PCA and have a question about visualization when it comes to fitting and transposing. I have two data, which are train and test. Here are four methods:

# Method 1)
pca = myPCA(n_components = 5) # Conduct myPCA with 5 principal components.
pca.fit(X_train) # Calculate 5 principal components on the training dataset
X_train_pca = pca.transform(X_train)

# Method 2)
pca = myPCA(n_components = 5) # Conduct myPCA with 5 principal components.
pca.fit(X_test)
X_test_pca = pca.transform(X_test)

# Method 3)
pca = myPCA(n_components = 5) # Conduct myPCA with 5 principal components.
pca.fit(X_train)
X_test_pca = pca.transform(X_train)

# Method 4)
pca = myPCA(n_components = 5) # Conduct myPCA with 5 principal components.
pca.fit(X_train)
X_test_pca = pca.transform(X_test)

Which one of the 4 methods above is the right way to use PCA for visualization? Although the PCA's tutorial clearly says it needs to be run on the test data, however, it seems that I could not write the right method for this.

Here is my code:

class myPCA():
"""
Principal Component Analysis (A Linear Dimension Reduction Method).
"""

def __init__(self, n_components = 2):
    """
    Conduct myPCA with 2 principal components(the principal and orthogonal modes of variation).
    """
    self.n_c = n_components


def fit(self,X):
    """
    The procedure of computing the covariance matrix.
    """
    cov_mat = np.cov(X.T) # Covariance matrix
    eig_val, eig_vec = np.linalg.eigh(cov_mat) # Eigen-values and orthogonal eigen-vectors in ascending order.
    eig_val = np.flip(eig_val) # Reverse the order, now it is descending.
    eig_vec = np.flip(eig_vec,axis=1) # reverse the order
    self.eig_values = eig_val[:self.n_c] # select the top eigen-vals
    self.principle_components = eig_vec[:,:self.n_c] # select the top eigen-vecs
    self.variance_ratio = self.eig_values/eig_val.sum() # variance explained by each PC

def transform(self,X):
    """
    Compute the score matrix.
    """
    return np.matmul(X-X.mean(axis = 0),self.principle_components) #project the data (centered) on PCs

Visualization Code (still not sure whether to use X_train or X_test below):

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
figure = plt.figure(dpi=100)
plt.scatter(X_test_pca[:, 0], X_test_pca[:, 1],c=y_test, s=15,edgecolor='none', alpha=0.5,cmap=plt.cm.get_cmap('tab10', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();

Solution

  • If the task is just to visualize the data over 2 dimensions to identify the distribution of observations, it does not matter which one you fit or transform on. you can even fit over the entire dataset and then transform the same.

    But I am guessing you want to use this as a part of some model development pipeline and want to see if the transformation generalizes well over two datasets. If that is the case, you should always fit your transformations on training data and use that to transform training as well as testing data.

    This will help generalize the transformations and subsequently the models on new datasets.