Search code examples
matlabmachine-learningpcatraining-datadimensionality-reduction

Dimensionality reduction using PCA - MATLAB


I am trying to reduce dimensionality of a training set using PCA. I have come across two approaches.

[V,U,eigen]=pca(train_x);
eigen_sum=0;
for lamda=1:length(eigen)
     eigen_sum=eigen_sum+eigen(lamda,1);
     if(eigen_sum/sum(eigen)>=0.90)
           break;
     end    

end
train_x=train_x*V(:, 1:lamda);

Here, I simply use the eigenvalue matrix to reconstruct the training set with lower amount of features determined by principal components describing 90% of original set.

The alternate method that I found is almost exactly the same, save the last line, which changes to:

train_x=U(:,1:lamda);

In other words, we take the training set as the principal component representation of the original training set up to some feature lamda.

Both of these methods seem to yield similar results (out of sample test error), but there is difference, however minuscule it may be.

My question is, which one is the right method?


Solution

  • The answer depends on your data, and what you want to do.

    Using your variable names. Generally speaking is easy to expect that the outputs of pca maintain

    U = train_x * V
    

    But this is only true if your data is normalized, specifically if you already removed the mean from each component. If not, then what one can expect is

    U = train_x * V - mean(train_x * V)
    

    And in that regard, weather you want to remove or maintain the mean of your data before processing it, depends on your application.

    It's also worth noting that even if you remove the mean before processing, there might be some small difference, but it will be around floating point precision error

    ((train_x * V) - U) ./ U ~~ 1.0e-15
    

    And this error can be safely ignored