Search code examples
machine-learningpca

Principal component analysis - remove features or not?


When I applied PCA to my dataset, PC1 accounted for only 25% variation and about 22% by PC2.

When I'm applying random forests or any other machine learning model, do I still negate some mildly correlated variables on the basis of PCA output? Or that should only be done when PC1 and PC2 explains about 80% of variation in the dataset?


Solution

  • I'm not sure I got your question clearly. Anyway, I guess that you want to use PCA to improve the performance of your model: therefore, you should try different values and keep the number of components that maximize the considered metric on the validation set (perhaps through cross-validation), pretty much independently of the actual numerical value you get for the explained variance. This latter can give you a good insight about what the right number could be, but for supervised learning purposes it makes much much more sense to just try and choose according to your specific dataset.

    In case you meant if you could discard original features based on the outcome of PCA, the reply is definitely no. The principal components "live" in a different space with respect to the original features, and there's no way you can straightforwardly tell what features make up for each component. You must perform the subsequent training/prediction on components and forget about the original features if you want to get something out of PCA.