Search code examples
rpca

PCA from FactoMineR::PCA() not returning same as base::prcomp()


Doing some PCA analysis and when comparing with the results for FactoMineR function PCA together with prcomp from base I dont get the same results. One example

library(ISLR)
library(FactoMineR)
data("NCI60")

df <- NCI60$data


pca_prcomp <- prcomp(df, scale. = T)
pca_facto <- FactoMineR::PCA(df, scale.unit = T, graph = F, ncp = 65)


# One column is missing

dim(pca_prcomp$x)
dim(pca_facto$ind$coord) 

# Values are similiare - but not the same

head(pca_prcomp$x[, 1:2])
head(pca_facto$ind$coord[, 1:2])


# Using scale function - does not return same values

pca_facto_scale <- PCA(scale(df), scale.unit = F, graph = F, ncp = 65)

head(pca_facto$ind$coord[, 1:2], 3)
head(pca_facto_scale$ind$coord[, 1:2], 3)

Solution

  • Sorry for being late, the FactoMineR package uses the same approach of svd() which should be similar (but not identical) with the prcomp() approach and both of them are listed under the Q-mode, which is the preferred method to do PCA for its numerical accuracy. But note, I didn't say identical, why? FactoMineR uses its own algorithm for PCA where it calculates the number of components like the following:

    ncp <- min(ncp, nrow(X) - 1, ncol(X))
    

    which tells you clearly why you got number of components 63 not 64 as what prcomp() would normally give. Your data set is typical of genomics data where you have n rows smaller than p columns of genes and the above code will clearly take columns or rows, whichever has the less number. If you follow the svd() algorithm it will return 64 dimensions not 63.

    To explore the source code further type FactoMineR:::PCA.

    For differences between the Q-mode (svd, prcomp(), FactoMineR::PCA()) and R-mode (eigen(), princomp()) I would recommend visiting this answer.

    Side note: for prcomp() you want pass the center = T argument in order to center your data before doing PCA. Scaling on the other hand will give all your gene columns equal weight.

    pca_prcomp <- prcomp(df, center = T, scale. = T) # add center=T
    

    For scaling, the prcomp() use N as the divisor while FactoMineR::PCA() uses N-1 instead. The code below will prove it (refer to the same linked answer above):

    # this is the scaled data by scale()
    df_scaled <- scale(df)
    
    # then you need to get the standardized data matrix from the output of the FactoMineR::PCR() function, which can be done easily as follows:
    df_restored <- pca_facto$svd$U %*% diag(pca_facto$svd$vs) %*% t(pca_facto$svd$V)
    
    # the to make both FactoMineR::PCR() and scale() match up you need to do the correction
    df_corrected <- df_restored * sqrt(63 / 64) # correct for sqrt(N-1/N)
    
    head(df[, 1:5]) # glimpse the first five columns only!
    head(df_scaled[, 1:5])
    head(df_restored[, 1:5]) # glimpse the first five columns only!
    head(df_corrected[, 1:5])
    round(head(df_scaled[, 1:5]), 3) == round(head(df_corrected[, 1:5]), 3) # TRUE
    
    R> head(df[, 1:5])
           1      2      3      4      5
    V1 0.300  1.180  0.550  1.140 -0.265
    V2 0.680  1.290  0.170  0.380  0.465
    V3 0.940 -0.040 -0.170 -0.040 -0.605
    V4 0.280 -0.310  0.680 -0.810  0.625
    V5 0.485 -0.465  0.395  0.905  0.200
    V6 0.310 -0.030 -0.100 -0.460 -0.205
    R> head(df_scaled[, 1:5])
           1        2      3      4      5
    V1 0.723  1.59461  1.315  1.345 -0.600
    V2 1.584  1.73979  0.438  0.649  0.905
    V3 2.173 -0.01609 -0.346  0.264 -1.301
    V4 0.678 -0.37256  1.615 -0.441  1.235
    V5 1.142 -0.57720  0.958  1.130  0.359
    V6 0.746 -0.00289 -0.185 -0.120 -0.476
    R> head(df_restored[, 1:5])
          [,1]     [,2]   [,3]   [,4]   [,5]
    [1,] 0.729  1.60722  1.326  1.356 -0.605
    [2,] 1.596  1.75354  0.442  0.654  0.912
    [3,] 2.190 -0.01622 -0.349  0.266 -1.311
    [4,] 0.683 -0.37550  1.628 -0.444  1.244
    [5,] 1.151 -0.58176  0.965  1.139  0.361
    [6,] 0.752 -0.00291 -0.186 -0.121 -0.480
    R> head(df_corrected[, 1:5])
          [,1]     [,2]   [,3]   [,4]   [,5]
    [1,] 0.723  1.59461  1.315  1.345 -0.600
    [2,] 1.584  1.73979  0.438  0.649  0.905
    [3,] 2.173 -0.01609 -0.346  0.264 -1.301
    [4,] 0.678 -0.37256  1.615 -0.441  1.235
    [5,] 1.142 -0.57720  0.958  1.130  0.359
    [6,] 0.746 -0.00289 -0.185 -0.120 -0.476
    R> round(head(df_scaled[, 1:5]), 3) == round(head(df_corrected[, 1:5]), 3)
          1    2    3    4    5
    V1 TRUE TRUE TRUE TRUE TRUE
    V2 TRUE TRUE TRUE TRUE TRUE
    V3 TRUE TRUE TRUE TRUE TRUE
    V4 TRUE TRUE TRUE TRUE TRUE
    V5 TRUE TRUE TRUE TRUE TRUE
    V6 TRUE TRUE TRUE TRUE TRUE
    

    Book excerpt

    There is also the book for FactoMineR package called "Exploratory Multivariate Analysis by Example Using R" 2nd edition by François Husson, Sébastien, and Lê Jérôme Pagès. Below is an excerpt from page 55 of the book which was discussing a data set from a genomic study similar to yours with n rows (43) far less than p 7407 columns chicken.csv data set, you can see more info in their website as well as the data set itself can be downloaded from this link.

    enter image description here