Search code examples
rplotpca

While using R, PCA and Plotting Cumulative Variance


I am working with R using a scaled dataset and principle component analysis (princomp). Everything works fine but I would like to graph the cumulative % variances of principle components to the whole. The summary provides this info but I am not able to access it yet. In other words, I want to y='Cumulative Proportion' from pca vs. 'component#'.

pca <- princomp(class5_subset_scaled)
summary(pca) # summary provides 

Importance of components:
                          Comp.1     Comp.2 ...
Standard deviation     0.0513980 0.04482971 ...
Proportion of Variance 0.2089728 0.15897513 ...
Cumulative Proportion  0.2089728 0.36794789 ...

However when I look at the names I am puzzled...

names(pc)
[1] "sdev" "loadings" "center" "scale" "n.obs" "scores" "call" 

Can I plot y='Cumulative Proportion' from pca vs. x='component#'?


Solution

  • You do not provide any data so I will illustrate with the internal iris data set. The summary shows what you want to get.

    iPCA = princomp(iris[,1:4])
    
    summary(iPCA)
    Importance of components:
                              Comp.1     Comp.2     Comp.3      Comp.4
    Standard deviation     2.0494032 0.49097143 0.27872586 0.153870700
    Proportion of Variance 0.9246187 0.05306648 0.01710261 0.005212184
    Cumulative Proportion  0.9246187 0.97768521 0.99478782 1.000000000
    

    As you noticed, the return from princomp has a component called sdev that is the "Standard deviation"

    iPCA$sdev
       Comp.1    Comp.2    Comp.3    Comp.4 
    2.0494032 0.4909714 0.2787259 0.1538707
    

    The variance is the square of the standard deviation.

    iPCA$sdev^2
        Comp.1     Comp.2     Comp.3     Comp.4 
    4.20005343 0.24105294 0.07768810 0.02367619
    

    The proportion of variance is the variance divided by the sum of all variances.

    iPCA$sdev^2 / sum(iPCA$sdev^2)
         Comp.1      Comp.2      Comp.3      Comp.4 
    0.924618723 0.053066483 0.017102610 0.005212184 
    

    And the Cumulative Proportion is the cumulative sum of the proportion of variance

    cumsum(iPCA$sdev^2 / sum(iPCA$sdev^2))
       Comp.1    Comp.2    Comp.3    Comp.4 
    0.9246187 0.9776852 0.9947878 1.0000000
    

    Now you have the Cumulative Proportion values, just plot them.

    plot(cumsum(iPCA$sdev^2 / sum(iPCA$sdev^2)), type="b")
    

    Cumulative proportion.

    Also, notice the scale on the plot. Depending on what you plan to do with the plot, you might really have wanted:

    plot(cumsum(iPCA$sdev^2 / sum(iPCA$sdev^2)), type="b", ylim=0:1)
    

    Cumulative plot to scale