Search code examples
pythonscikit-learnpcavariance

How to interpret Principal Component numbers to determine % of variation (Python)?


I'm trying to determine how many principal components explain more than 90% of variation. I have the following:

from sklearn.decomposition import PCA
pca = PCA(n_components=11)
pca.fit_transform(X)

print(pca.explained_variance_, '\n\n') ##Line A

print(pca.explained_variance_ratio_) ##Line B

This outputs:

[1.79594388e+04 6.33546080e+02 4.45515520e+02 1.75087416e+02
 9.27041405e+01 4.09510643e+01 1.58667003e+01 6.04190503e+00
 3.33657900e+00 4.48917873e-01 1.06491531e-32] 


[9.27037479e-01 3.27026344e-02 2.29967979e-02 9.03773211e-03
 4.78523932e-03 2.11382838e-03 8.19013667e-04 3.11873465e-04
 1.72228866e-04 2.31724219e-05 5.49692234e-37]

I'm not sure whether to use Lina A or Line B to determine the number of Principal Components that explain more than 90% of variation. How do I interpret these numbers?


Solution

  • According to the documentation you would need line B. All the ratio's sum up to 1.0. Using only the first component will explain 92.7 percent of the variance, while using the first two will result in/explain almost 96 percent of the variance.

    line_b = [9.27037479e-01, 3.27026344e-02, 2.29967979e-02, 9.03773211e-03,
    4.78523932e-03, 2.11382838e-03, 8.19013667e-04, 3.11873465e-04,
    1.72228866e-04, 2.31724219e-05, 5.49692234e-37]
    
    print(f"Percentage first component = {line_b[0]*100}")
    print(f"Percentage first and second component = {sum(line_b[0:2])*100}")
    

    output:

    Percentage first component = 92.7037479
    Percentage first and second component = 95.97401134