Search code examples
rscikit-learnpcaeigenvalueeigenvector

PCA : eigen values vs eigen vectors vs loadings in python vs R?


I am trying to calculate PCA loadings of a dataset. The more I read about it, the more I get confused because "loadings" is used differently at many places.

I am using sklearn.decomposition in python for PCA analysis as well as R (using factomineR and factoextra libraries) as it provides easy visualization techniques. The following is my understanding:

  1. pca.components_ give us the eigen vectors. They give us the directions of maximum variation.
  2. pca.explained_variance_ give us the eigen values associated with the eigen vectors.
  3. eigenvectors * sqrt(eigen values) = loadings which tell us how principal components (pc's) load the variables.

Now, what I am confused by is:

  1. Many forums say that eigen vectors are the loadings. Then, when we multiply the eigen vectors by the sqrt(eigen values) we just get the strength of association. Others say eigenvectors * sqrt(eigen values) = loadings.

  2. Eigen vectors squared tells us the contribution of variable to pc? I believe this is equivalent to var$contrib in R.

  3. loading squared (eigen vector or eigenvector*sqrt(eigenvalue) I don't know which one) shows how well a pc captures a variable (closer to 1 = variable better explained by a pc). Is this equivalent of var$cos2 in R? If not what is cos2 in R?

  4. Basically I want to know how to understand how well a principal component captures a variable and what is the contribution of a variable to a pc. I think they both are different.

  5. What is pca.singular_values_? It is not clear from the documentation.

These first and second links that I referred which contains R code with explanation and the statsexchange forum that confused me.


Solution

  • Okay, after much research and going through many papers I have the following,

    1. pca.components_ = eigen vectors. Take a transpose so that pc's are columns and variables are rows.

    1.a: eigenvector**2 = variable contribution in principal components. If it's close to 1 then a particular pc is well explained by that variable.

    In python -> (pow(pca.components_.T),2) [Multiply with 100 if you want percentages and not proportions] [R equivalent -> var$contrib]

    1. pca.variance_explained_ = eigen values

    2. pca.singular_values_ = singular values obtained from SVD. (singular values)**2/(n-1) = eigen values

    3. eigen vectors * sqrt(eigen values) = loadings matrix

    4.a: vertical sum of squared loading matrix = eigen values. (Given you have taken transpose as explained in step 1)

    4.b: horizontal sum of squared loading matrix = observation's variance explained by all principal components -How much all pc's retain a variables variance after transformation. (Given you have taken transpose as explained in step 1)

    In python-> loading matrix = pca.components_.T * sqrt(pca.explained_variance_).

    For questions pertaining to r:

    var$cos2 = var$cor (Both matrices are same). Given the coordinates of the variables on a factor map, how well it is represented by a particular principal component. Seems like variable and principal component's correlation.

    var$contrib = Summarized by point 1. In r:(var.cos2 * 100) / (total cos2 of the component) PCA analysis in R link

    Hope it helps others who are confused by PCA analysis.

    Huge thanks to -- https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another