Search code examples
rpcar-caret

How to obtain principal component % variance explained in R? prcomp() and preProcess() comparison


I know that PCA can be conducted with the prcomp() function in base R, or with the preProcess() function in the caret package, amongst others.

Firstly, am I right in saying that if we just use the default settings for operations of type prcomp(<SOME_MATRIX>) or preProcess(<SOME_MATRIX>, method = "pca"), then the only difference in our results is that prcomp() does not centre and scale the data before conducting PCA, and preProcess() does? Therefore, do prcomp(scale(<SOME_MATRIX>)) and preProcess(<SOME_MATRIX>, method = "pca") output the same thing?

Secondly, and more importantly, how can we obtain the % variance explained by each PC from the output of either prcomp() or preProcess()? From both of these outputs I can see things like the means, standard deviations or rotations, but I think these refer just to the 'old' variables. Where is the information about the 'new' PCs and how much variance they account for?

This might come in useful if, for example, I am using preProcess(<SOME_MATRIX>, method = "pca", thresh = 0.8) and this returns 6 PCs, but I find that the first 5 PCs explain a total of 79.5% of the variance. Then I might be inclined not to include all 6 PCs.


Solution

  • Since your first question has already been answered, here the answer to your second question for prcomp. We can get the % variance explained by each PC by calling summary:

    df <- iris[1:4]
    pca_res <- prcomp(df, scale. = TRUE)
    summ <- summary(pca_res)
    summ
    
    #Importance of components:
    #                          PC1    PC2     PC3     PC4
    #Standard deviation     1.7084 0.9560 0.38309 0.14393
    #Proportion of Variance 0.7296 0.2285 0.03669 0.00518
    #Cumulative Proportion  0.7296 0.9581 0.99482 1.00000
    
    summ$importance[2,]
    # PC1     PC2     PC3     PC4 
    #0.72962 0.22851 0.03669 0.00518
    

    From what I know this information is not available when using the caret package (see issue discussed here):

    mod <- train(Species ~ ., data = iris, method = "knn",
                                preProc = c("center", "scale", "pca"))
    str(mod$preProcess) 
    
    
    List of 22
     $ dim              : int [1:2] 150 4
     $ bc               : NULL
     $ yj               : NULL
     $ et               : NULL
     $ invHyperbolicSine: NULL
     $ mean             : Named num [1:4] 5.84 3.06 3.76 1.2
      ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
     $ std              : Named num [1:4] 0.828 0.436 1.765 0.762
      ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
     $ ranges           : NULL
     $ rotation         : num [1:4, 1:2] 0.521 -0.269 0.58 0.565 -0.377 ...
      ..- attr(*, "dimnames")=List of 2
      .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
      .. ..$ : chr [1:2] "PC1" "PC2"
     $ method           :List of 4
      ..$ center: chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
      ..$ scale : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
      ..$ pca   : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
      ..$ ignore: chr(0) 
     $ thresh           : num 0.95
     $ pcaComp          : NULL
     $ numComp          : num 2
     $ ica              : NULL
     $ wildcards        :List of 2
      ..$ PCA: chr(0) 
      ..$ ICA: chr(0) 
     $ k                : num 5
     $ knnSummary       :function (x, ...)  
     $ bagImp           : NULL
     $ median           : NULL
     $ data             : NULL
     $ rangeBounds      : num [1:2] 0 1
     $ call             : chr "scrubed"
     - attr(*, "class")= chr "preProcess"