Doing some PCA analysis and when comparing with the results for FactoMineR
function PCA
together with prcomp
from base
I dont get the same results. One example
library(ISLR)
library(FactoMineR)
data("NCI60")
df <- NCI60$data
pca_prcomp <- prcomp(df, scale. = T)
pca_facto <- FactoMineR::PCA(df, scale.unit = T, graph = F, ncp = 65)
# One column is missing
dim(pca_prcomp$x)
dim(pca_facto$ind$coord)
# Values are similiare - but not the same
head(pca_prcomp$x[, 1:2])
head(pca_facto$ind$coord[, 1:2])
# Using scale function - does not return same values
pca_facto_scale <- PCA(scale(df), scale.unit = F, graph = F, ncp = 65)
head(pca_facto$ind$coord[, 1:2], 3)
head(pca_facto_scale$ind$coord[, 1:2], 3)
Sorry for being late, the FactoMineR
package uses the same approach of svd()
which should be similar (but not identical) with the prcomp()
approach and both of them are listed under the Q-mode, which is the preferred method to do PCA for its numerical accuracy. But note, I didn't say identical, why? FactoMineR
uses its own algorithm for PCA where it calculates the number of components like the following:
ncp <- min(ncp, nrow(X) - 1, ncol(X))
which tells you clearly why you got number of components 63 not 64 as what prcomp()
would normally give. Your data set is typical of genomics data where you have n
rows smaller than p
columns of genes and the above code will clearly take columns or rows, whichever has the less number. If you follow the svd()
algorithm it will return 64 dimensions not 63.
To explore the source code further type FactoMineR:::PCA
.
For differences between the Q-mode (svd
, prcomp()
, FactoMineR::PCA()
) and R-mode (eigen()
, princomp()
) I would recommend visiting this answer.
Side note: for prcomp()
you want pass the center = T
argument in order to center your data before doing PCA. Scaling on the other hand will give all your gene columns equal weight.
pca_prcomp <- prcomp(df, center = T, scale. = T) # add center=T
For scaling, the prcomp()
use N
as the divisor while FactoMineR::PCA()
uses N-1
instead. The code below will prove it (refer to the same linked answer above):
# this is the scaled data by scale()
df_scaled <- scale(df)
# then you need to get the standardized data matrix from the output of the FactoMineR::PCR() function, which can be done easily as follows:
df_restored <- pca_facto$svd$U %*% diag(pca_facto$svd$vs) %*% t(pca_facto$svd$V)
# the to make both FactoMineR::PCR() and scale() match up you need to do the correction
df_corrected <- df_restored * sqrt(63 / 64) # correct for sqrt(N-1/N)
head(df[, 1:5]) # glimpse the first five columns only!
head(df_scaled[, 1:5])
head(df_restored[, 1:5]) # glimpse the first five columns only!
head(df_corrected[, 1:5])
round(head(df_scaled[, 1:5]), 3) == round(head(df_corrected[, 1:5]), 3) # TRUE
R> head(df[, 1:5])
1 2 3 4 5
V1 0.300 1.180 0.550 1.140 -0.265
V2 0.680 1.290 0.170 0.380 0.465
V3 0.940 -0.040 -0.170 -0.040 -0.605
V4 0.280 -0.310 0.680 -0.810 0.625
V5 0.485 -0.465 0.395 0.905 0.200
V6 0.310 -0.030 -0.100 -0.460 -0.205
R> head(df_scaled[, 1:5])
1 2 3 4 5
V1 0.723 1.59461 1.315 1.345 -0.600
V2 1.584 1.73979 0.438 0.649 0.905
V3 2.173 -0.01609 -0.346 0.264 -1.301
V4 0.678 -0.37256 1.615 -0.441 1.235
V5 1.142 -0.57720 0.958 1.130 0.359
V6 0.746 -0.00289 -0.185 -0.120 -0.476
R> head(df_restored[, 1:5])
[,1] [,2] [,3] [,4] [,5]
[1,] 0.729 1.60722 1.326 1.356 -0.605
[2,] 1.596 1.75354 0.442 0.654 0.912
[3,] 2.190 -0.01622 -0.349 0.266 -1.311
[4,] 0.683 -0.37550 1.628 -0.444 1.244
[5,] 1.151 -0.58176 0.965 1.139 0.361
[6,] 0.752 -0.00291 -0.186 -0.121 -0.480
R> head(df_corrected[, 1:5])
[,1] [,2] [,3] [,4] [,5]
[1,] 0.723 1.59461 1.315 1.345 -0.600
[2,] 1.584 1.73979 0.438 0.649 0.905
[3,] 2.173 -0.01609 -0.346 0.264 -1.301
[4,] 0.678 -0.37256 1.615 -0.441 1.235
[5,] 1.142 -0.57720 0.958 1.130 0.359
[6,] 0.746 -0.00289 -0.185 -0.120 -0.476
R> round(head(df_scaled[, 1:5]), 3) == round(head(df_corrected[, 1:5]), 3)
1 2 3 4 5
V1 TRUE TRUE TRUE TRUE TRUE
V2 TRUE TRUE TRUE TRUE TRUE
V3 TRUE TRUE TRUE TRUE TRUE
V4 TRUE TRUE TRUE TRUE TRUE
V5 TRUE TRUE TRUE TRUE TRUE
V6 TRUE TRUE TRUE TRUE TRUE
Book excerpt
There is also the book for FactoMineR
package called "Exploratory Multivariate Analysis by Example Using R" 2nd edition by François Husson, Sébastien, and Lê Jérôme Pagès. Below is an excerpt from page 55 of the book which was discussing a data set from a genomic study similar to yours with n
rows (43) far less than p
7407 columns chicken.csv
data set, you can see more info in their website as well as the data set itself can be downloaded from this link.