Search code examples
rmultidimensional-arrayvectorpca

What exactly are ctr, distance and dimensions in PCA summary in FactoMineR?


I am trying to use the FactoMineR package for implementing PCA and MCA on my datasets.

I have a dataset and after little initial cleanup, I applied the PCA() function on it. I tried understanding the summary of the results.

library(reshape)
library(gridExtra)
library(gdata)
library(ggplot2)
library(ggbiplot)
library(FactoMineR)

x <- read.csv('cars.csv',stringsAsFactors = FALSE)
y <- na.omit(x)

y <- y[,c(-8,-9)]
s <- y[,-1]
rownames(s) <- make.names(y[,1], unique = TRUE)


res.pca <- PCA(s, quanti.sup = NULL, quali.sup=NULL,scale.unit = TRUE,ncp=2)
summary(res.pca)

This is what summary(res.pca) prints out in my console

Call:
PCA(X = s, scale.unit = TRUE, ncp = 2, quanti.sup = NULL, quali.sup = NULL) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
Variance               4.788   0.729   0.258   0.125   0.063   0.036
% of var.             79.804  12.144   4.308   2.086   1.053   0.605
Cumulative % of var.  79.804  91.948  96.256  98.342  99.395 100.000

Individuals (the 10 first)
                              Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
chevrolet.chevelle.malibu |  2.516 |  2.326  0.288  0.855 | -0.572  0.115  0.052 |
buick.skylark.320         |  3.307 |  3.206  0.548  0.940 | -0.683  0.163  0.043 |
plymouth.satellite        |  2.915 |  2.670  0.380  0.839 | -0.994  0.346  0.116 |
amc.rebel.sst             |  2.749 |  2.605  0.362  0.898 | -0.623  0.136  0.051 |
ford.torino               |  2.908 |  2.600  0.360  0.799 | -1.094  0.419  0.141 |
ford.galaxie.500          |  4.578 |  4.401  1.032  0.924 | -1.011  0.358  0.049 |
chevrolet.impala          |  5.210 |  4.920  1.289  0.892 | -1.368  0.655  0.069 |
plymouth.fury.iii         |  5.144 |  4.836  1.246  0.884 | -1.537  0.827  0.089 |
pontiac.catalina          |  5.165 |  4.910  1.285  0.904 | -1.041  0.379  0.041 |
amc.ambassador.dpl        |  4.406 |  4.056  0.876  0.847 | -1.668  0.974  0.143 |

Variables
                             Dim.1    ctr   cos2    Dim.2    ctr   cos2  
Cylinders                 |  0.942 18.543  0.888 |  0.127  2.200  0.016 |
Displacement              |  0.971 19.672  0.942 |  0.093  1.177  0.009 |
Horsepower                |  0.950 18.846  0.902 | -0.142  2.761  0.020 |
Weight                    |  0.941 18.499  0.886 |  0.244  8.185  0.060 |
MPG                       | -0.873 15.918  0.762 | -0.209  5.994  0.044 |
Acceleration              | -0.639  8.522  0.408 |  0.762 79.683  0.581 |

While I understood everything from this summary, I am not sure what dist, ctr and dim on the data points mean i.e.

 Individuals (the 10 first)
                              Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
chevrolet.chevelle.malibu |  2.516 |  2.326  0.288  0.855 | -0.572  0.115  0.052 |
buick.skylark.320         |  3.307 |  3.206  0.548  0.940 | -0.683  0.163  0.043 |
plymouth.satellite        |  2.915 |  2.670  0.380  0.839 | -0.994  0.346  0.116 |
amc.rebel.sst             |  2.749 |  2.605  0.362  0.898 | -0.623  0.136  0.051 |

Solution

  • Let's look at the summary table on individuals based on a sample dataset from the package for illustration:

    library(FactoMineR)
    data(decathlon)
    res.pca <- PCA(decathlon, quanti.sup = 11:12, quali.sup=13)
    
    > summary(res.pca)
    Call:
    PCA(X = decathlon, ncp = 5, quanti.sup = 11:12, quali.sup = 13) 
    ...
    Individuals (the 10 first)
                    Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
    SEBRLE      |  2.369 |  0.792  0.467  0.112 |  0.772  0.836  0.106 |  0.827  1.187
    CLAY        |  3.507 |  1.235  1.137  0.124 |  0.575  0.464  0.027 |  2.141  7.960
    KARPOV      |  3.396 |  1.358  1.375  0.160 |  0.484  0.329  0.020 |  1.956  6.644
    ...
    

    Dist can be thought of as a summary measure of an individual's measurements across all relevant columns in the dataset, calculated as sqrt(rowSums(X^2)), where X is a scaled version of the input dataset s (after trimming away the supplementary variables).

    If the default options in PCA are in place i.e. scale.unit = TRUE, row.w = NULL, col.w = NULL, X should be equivalent to scale(as.matrix(<trimmed down dataset>)) * sqrt(n/n-1). I have not checked this for non-default options, as I find the intuitive interpretation more important than the detailed calculations here.

    # verify the calculated values against summary table's Dist values
    > X <- scale(as.matrix(decathlon[,1:10])) * sqrt(nrow(decathlon)/(nrow(decathlon) - 1))
    > sqrt(rowSums(X^2))
         SEBRLE        CLAY      KARPOV     BERNARD      YURKOV     WARNERS   ZSIVOCZKY 
       2.368839    3.507004    3.396399    2.762607    3.017906    2.427873    2.563128 
    ...
    

    Dim.X measures the projection of each individual's distance from origin in multidimentional space to principle component X. To visualise this, use plot(res.pca, choix = "ind") for the indivudal factor map, toggle the xlim / ylim / axes arguments to zoom in on any specific individual, & compare against the table values. Check ?plot.PCA for more arguments in the function.

    # plot individual factor map in the first two principle components
    > plot(res.pca, axes = c(1, 2), choix = "ind")
    
    # zoom in check Serbrle, Clay, & Karpov's coordinates
    > plot(res.pca, axes = c(1, 2), choix = "ind", xlim = c(0, 2), ylim = c(0, 1))
    

    individual factor map, zoomed in

    ctr indicates each individual's contribution to a given principle component, in percentage form. You can get the full list of contributions from res.pca$ind$contrib. Each column sums up to 100(%).

    # view each individual's contribution to each principle component
    > head(res.pca$ind$contrib)
                 Dim.1     Dim.2    Dim.3      Dim.4      Dim.5
    SEBRLE  0.46715109 0.8359506 1.186888  3.1842186  1.7811617
    CLAY    1.13695340 0.4635341 7.959744  0.2905893 13.8872052
    KARPOV  1.37515734 0.3289363 6.643820  7.9543342  2.2523610
    BERNARD 0.27693912 1.0740657 1.374952 11.3801552  0.4658144
    YURKOV  0.25595504 6.3757577 2.605847  1.7611939  5.5775065
    WARNERS 0.09494738 3.9862179 1.020117  0.8014610  3.5736432
    
    # verify each principle component's contributions sum up to 100%.
    > colSums(res.pca$ind$contrib)
    Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 
      100   100   100   100   100 
    

    cos2 is the squared cosine for each principle component, calculated as (Dim.X/Dist)^2. The closer it is to 1 for a given principle component, the better that principle component is at capturing all the characteristics of that individual.

    # verify the calculated values against summary table's cos2 values
    > head((res.pca$ind$coord/res.pca$ind$dist)^2)
                 Dim.1      Dim.2      Dim.3      Dim.4      Dim.5
    SEBRLE  0.11167888 0.10610262 0.12183534 0.24588345 0.08911755
    CLAY    0.12400941 0.02684265 0.37278712 0.01023775 0.31701007
    KARPOV  0.15991886 0.02030911 0.33175306 0.29878849 0.05481905
    BERNARD 0.04867778 0.10023262 0.10377289 0.64611132 0.01713585
    YURKOV  0.03769960 0.49858212 0.16480554 0.08379015 0.17193305
    WARNERS 0.02160805 0.48164324 0.09968563 0.05891525 0.17021193
    

    For variables, interpretations for "Dim.X" / "ctr" / "cos2" are similar. The exact calculations are more complicated, especially if you specify non-uniform weights for rows / columns. You can check PCA's code for details there.