python machine-learning scikit-learn pca

Explanation of the percentage value in scikit-learn PCA method

In scikit-learn, there is a method called PCA. This method takes a percentage parameter. This site explains this parameter as follows:

Notice the code below has .95 for the number of components parameter. It means that scikit-learn choose the minimum number of principal components such that 95% of the variance is retained.

> from sklearn.decomposition import PCA
> # Make an instance of the Model 
> pca = PCA(.95)

I'm a bit in the dark about the interpretation of this explanation. Let's say the output of the PCA would be as follows:

PC1 explains 70 % of the complete variance
PC2 explains 15 % of the complete variance
PC3 explains 10 % of the complete variance
PC4 explains 4 % of the complete variance
PC5 explains 1 % of the complete variance

Would the statement PCA(0.71) return PC1 and PC5 (as they both explain exactly 71 % of the variance) or would it return PC1 and PC2? What happens if I would like to retrieve 0.5 % of the variance, i.e. which PC would the statement PCA(0.005) return?

Solution

You touch a more general point that, although used all the time in practice, it is seldom explicitly mentioned, not even in tutorials and introductory expositions. Although such a question had never occurred to myself, it makes perfect sense from a beginner's point of view (beginners are usually free of some conventions, that more experienced practitioners take for granted and often they did not even notice them...).

Normally, when we select the no. of principal components (e.g. for dimensionality reduction, visualization etc), we select a number k, and implicitly it is meant "start from PC1 and continue sequentially, up to (and including) PCk". This is the principle, say, behind the preProcess function of the caret package in R (and arguably behind all functions performing similar tasks, in whatever software package).

In other words, and to the best of my knowledge at least, in cases such as the one you describe, we never choose PC's by cherrypicking (i.e. take PC2, PC4, and PC5, for example). Instead, we always choose a k < n (here n=5), and then we proceed to take all the first k PC's, i.e. starting from PC1 onward.