Search code examples
pythonmachine-learningscikit-learncomputer-visionpca

Applying PCA to one sample


I am currently working on an image recognition project with machine learning.

  • The train set has 1600 images with size 300x300, so 90000 features per image.
  • To speed up training, I apply PCA with n_components = 50
  • The test set has 450 images and I can test the model in this test set successfully.

Now, I want to predict a single image that is captured by webcam. The question is that should I apply PCA to that image?

  • If I don't apply PCA, I get ValueError: X.shape[1] = 90000 should be equal to 50, the number of features at training time
  • If I apply PCA, I get ValueError: n_components=50 must be between 0 and min(n_samples, n_features)=1 with svd_solver='full'

I use Python 3, scikit-learn 0.20.3, this is how I apply PCA:

from sklearn.decomposition import PCA
pca = PCA(50)
pca.fit_transform(features)

Solution

  • You need to apply PCA on your test set as well.

    You need to consider what PCA does:

    PCA constructs a new features set (containing less features than the original feature space) and then you subsequently train on this new feature set. You need to construct this new feature set for the test set for your model to be valid!

    Its important to note that each feature in your 'reduced' feature set are a linear combination of the original features, where for a given number of new features (n_components) they are the feature set that maximize the variance of the original space preserved in the new space.

    Practically to perform the relevant transformation on your test set, you need to do:

    # X_test - your untransformed test set
    
    X_test_reduced = pca.transform(X_test)
    

    where pca is the instance of PCA() trained on your training set. Essentialy you are constructing a transformation to a lower-dimensional space and you want this transformation to be the same for the training and test set! If you train pca independently on both the training and test set, you are (nearly certainly) embedding the data into different low-dimensional representations and have different feature sets.