I am trying to use the FactoMineR
package for implementing PCA and MCA on my datasets.
I have a dataset and after little initial cleanup, I applied the PCA()
function on it. I tried understanding the summary of the results.
library(reshape)
library(gridExtra)
library(gdata)
library(ggplot2)
library(ggbiplot)
library(FactoMineR)
x <- read.csv('cars.csv',stringsAsFactors = FALSE)
y <- na.omit(x)
y <- y[,c(-8,-9)]
s <- y[,-1]
rownames(s) <- make.names(y[,1], unique = TRUE)
res.pca <- PCA(s, quanti.sup = NULL, quali.sup=NULL,scale.unit = TRUE,ncp=2)
summary(res.pca)
This is what summary(res.pca)
prints out in my console
Call:
PCA(X = s, scale.unit = TRUE, ncp = 2, quanti.sup = NULL, quali.sup = NULL)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
Variance 4.788 0.729 0.258 0.125 0.063 0.036
% of var. 79.804 12.144 4.308 2.086 1.053 0.605
Cumulative % of var. 79.804 91.948 96.256 98.342 99.395 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2
chevrolet.chevelle.malibu | 2.516 | 2.326 0.288 0.855 | -0.572 0.115 0.052 |
buick.skylark.320 | 3.307 | 3.206 0.548 0.940 | -0.683 0.163 0.043 |
plymouth.satellite | 2.915 | 2.670 0.380 0.839 | -0.994 0.346 0.116 |
amc.rebel.sst | 2.749 | 2.605 0.362 0.898 | -0.623 0.136 0.051 |
ford.torino | 2.908 | 2.600 0.360 0.799 | -1.094 0.419 0.141 |
ford.galaxie.500 | 4.578 | 4.401 1.032 0.924 | -1.011 0.358 0.049 |
chevrolet.impala | 5.210 | 4.920 1.289 0.892 | -1.368 0.655 0.069 |
plymouth.fury.iii | 5.144 | 4.836 1.246 0.884 | -1.537 0.827 0.089 |
pontiac.catalina | 5.165 | 4.910 1.285 0.904 | -1.041 0.379 0.041 |
amc.ambassador.dpl | 4.406 | 4.056 0.876 0.847 | -1.668 0.974 0.143 |
Variables
Dim.1 ctr cos2 Dim.2 ctr cos2
Cylinders | 0.942 18.543 0.888 | 0.127 2.200 0.016 |
Displacement | 0.971 19.672 0.942 | 0.093 1.177 0.009 |
Horsepower | 0.950 18.846 0.902 | -0.142 2.761 0.020 |
Weight | 0.941 18.499 0.886 | 0.244 8.185 0.060 |
MPG | -0.873 15.918 0.762 | -0.209 5.994 0.044 |
Acceleration | -0.639 8.522 0.408 | 0.762 79.683 0.581 |
While I understood everything from this summary, I am not sure what dist, ctr and dim on the data points mean i.e.
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2
chevrolet.chevelle.malibu | 2.516 | 2.326 0.288 0.855 | -0.572 0.115 0.052 |
buick.skylark.320 | 3.307 | 3.206 0.548 0.940 | -0.683 0.163 0.043 |
plymouth.satellite | 2.915 | 2.670 0.380 0.839 | -0.994 0.346 0.116 |
amc.rebel.sst | 2.749 | 2.605 0.362 0.898 | -0.623 0.136 0.051 |
Let's look at the summary table on individuals based on a sample dataset from the package for illustration:
library(FactoMineR)
data(decathlon)
res.pca <- PCA(decathlon, quanti.sup = 11:12, quali.sup=13)
> summary(res.pca)
Call:
PCA(X = decathlon, ncp = 5, quanti.sup = 11:12, quali.sup = 13)
...
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
SEBRLE | 2.369 | 0.792 0.467 0.112 | 0.772 0.836 0.106 | 0.827 1.187
CLAY | 3.507 | 1.235 1.137 0.124 | 0.575 0.464 0.027 | 2.141 7.960
KARPOV | 3.396 | 1.358 1.375 0.160 | 0.484 0.329 0.020 | 1.956 6.644
...
Dist can be thought of as a summary measure of an individual's measurements across all relevant columns in the dataset, calculated as sqrt(rowSums(X^2))
, where X is a scaled version of the input dataset s
(after trimming away the supplementary variables).
If the default options in PCA
are in place i.e. scale.unit = TRUE
, row.w = NULL
, col.w = NULL
, X should be equivalent to scale(as.matrix(<trimmed down dataset>)) * sqrt(n/n-1)
. I have not checked this for non-default options, as I find the intuitive interpretation more important than the detailed calculations here.
# verify the calculated values against summary table's Dist values
> X <- scale(as.matrix(decathlon[,1:10])) * sqrt(nrow(decathlon)/(nrow(decathlon) - 1))
> sqrt(rowSums(X^2))
SEBRLE CLAY KARPOV BERNARD YURKOV WARNERS ZSIVOCZKY
2.368839 3.507004 3.396399 2.762607 3.017906 2.427873 2.563128
...
Dim.X measures the projection of each individual's distance from origin in multidimentional space to principle component X. To visualise this, use plot(res.pca, choix = "ind")
for the indivudal factor map, toggle the xlim
/ ylim
/ axes
arguments to zoom in on any specific individual, & compare against the table values. Check ?plot.PCA
for more arguments in the function.
# plot individual factor map in the first two principle components
> plot(res.pca, axes = c(1, 2), choix = "ind")
# zoom in check Serbrle, Clay, & Karpov's coordinates
> plot(res.pca, axes = c(1, 2), choix = "ind", xlim = c(0, 2), ylim = c(0, 1))
ctr indicates each individual's contribution to a given principle component, in percentage form. You can get the full list of contributions from res.pca$ind$contrib
. Each column sums up to 100(%).
# view each individual's contribution to each principle component
> head(res.pca$ind$contrib)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
SEBRLE 0.46715109 0.8359506 1.186888 3.1842186 1.7811617
CLAY 1.13695340 0.4635341 7.959744 0.2905893 13.8872052
KARPOV 1.37515734 0.3289363 6.643820 7.9543342 2.2523610
BERNARD 0.27693912 1.0740657 1.374952 11.3801552 0.4658144
YURKOV 0.25595504 6.3757577 2.605847 1.7611939 5.5775065
WARNERS 0.09494738 3.9862179 1.020117 0.8014610 3.5736432
# verify each principle component's contributions sum up to 100%.
> colSums(res.pca$ind$contrib)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
100 100 100 100 100
cos2 is the squared cosine for each principle component, calculated as (Dim.X/Dist)^2. The closer it is to 1 for a given principle component, the better that principle component is at capturing all the characteristics of that individual.
# verify the calculated values against summary table's cos2 values
> head((res.pca$ind$coord/res.pca$ind$dist)^2)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
SEBRLE 0.11167888 0.10610262 0.12183534 0.24588345 0.08911755
CLAY 0.12400941 0.02684265 0.37278712 0.01023775 0.31701007
KARPOV 0.15991886 0.02030911 0.33175306 0.29878849 0.05481905
BERNARD 0.04867778 0.10023262 0.10377289 0.64611132 0.01713585
YURKOV 0.03769960 0.49858212 0.16480554 0.08379015 0.17193305
WARNERS 0.02160805 0.48164324 0.09968563 0.05891525 0.17021193
For variables, interpretations for "Dim.X" / "ctr" / "cos2" are similar. The exact calculations are more complicated, especially if you specify non-uniform weights for rows / columns. You can check PCA
's code for details there.