Search code examples
rpcaprincomp

Calculating new variables from Principal Components - PCA in R


For purpose of learning PCA in R, I have run princomp() function (from MASS package) on iris dataset. I have followed following steps:

 library(MASS)
 irispca<-princomp(iris[-5])
 summary(irispca)
 irispca$loadings

In order to calculate principal components, I have used output of loadings in this way:

 iris_temp2 <- iris
 iris_temp2$Comp.1 <- with(iris_temp2,Sepal.Length*0.361+Petal.Length*0.857+Petal.Width*0.358)
 iris_temp2$Comp.2 <- with(iris_temp2,Sepal.Length*(-0.657)+Sepal.Width*(-0.73)+Petal.Length*0.173)
 iris_temp2$Comp.3 <- with(iris_temp2,Sepal.Length*(-0.582)+Sepal.Width*0.598+Petal.Width*0.546)
 iris_temp2$Comp.4 <- with(iris_temp2,Sepal.Length*0.315+Sepal.Width*(-0.32)+Petal.Length*(-0.48)+Petal.Width*0.754)
 iris_temp2 <- with(iris_temp2, iris_temp2[order(Comp.1,Comp.2,Comp.3,Comp.4),])

At last, I sorted the dataset. I have also come to know that scores gives out the same above thing i.e. scores are calculated by multiplying scaled data (on which you run PCA) with loadings. Hence, I thought of comparing output of scores and output of iris_temp2 (featuring four components).

 iris_temp1 <- as.data.frame(irispca$scores)
 iris_temp1 <- with(iris_temp1, iris_temp1[order(Comp.1,Comp.2,Comp.3,Comp.4),])

However, when I do head(iris_temp1) and head(iris_temp2[,6:9]), the outputs do not match.

I would request you people to point out the reason behind this observation. Is there anything that I have misunderstood? If you need any other input from my end, please do let me know.

Reference materials that I have used are: http://yatani.jp/teaching/doku.php?id=hcistats:pca and https://www.youtube.com/watch?v=I5GxNzKLIoU&spfreload=5.

Thanks Shankar


Solution

  • princomp does not reorder the data, each row is transformed to the scores, so there is no need to reorder the data when comparing. The scores involve both a demeaning of the data and a change of basis by the matrix of eigenvalues.

    What this means is that firstly you need to demean your data, i.e.

    library(MASS)
    irispca<-princomp(iris[-5])
    
    iris2 <- as.matrix(iris[-5])
    iris2 <- sweep(iris2, MARGIN=2, irispca$center, FUN="-")
    

    Then it is important to realize that the print method for princomp objects rounds values for display purpose

    irispca$loadings
    
    Loadings:
                 Comp.1 Comp.2 Comp.3 Comp.4
    Sepal.Length  0.361 -0.657  0.582  0.315
    Sepal.Width         -0.730 -0.598 -0.320
    Petal.Length  0.857  0.173        -0.480
    Petal.Width   0.358        -0.546  0.754
    

    But when we actually inspect one of the components we see its full values

    irispca$loadings[,1]
    
    Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
      0.36138659  -0.08452251   0.85667061   0.35828920
    

    Taking this into account we have

    is1 <- list()
    is1$Comp.1 <- iris2 %*% irispca$loadings[,1]
    is1$Comp.2 <- iris2 %*% irispca$loadings[,2]
    is1$Comp.3 <- iris2 %*% irispca$loadings[,3]
    is1$Comp.4 <- iris2 %*% irispca$loadings[,4]
    score1 <- as.data.frame(is1)
    

    which gives

    head(score1, 2)
    
    Comp.1     Comp.2     Comp.3      Comp.4
    -2.684126 -0.3193972 0.02791483 0.002262437
     2.714142  0.1770012 0.21046427 0.099026550
    
    
     head(irispca$scores, 2)
             Comp.1     Comp.2     Comp.3      Comp.4
     [1,] -2.684126 -0.3193972 0.02791483 0.002262437
     [2,] -2.714142  0.1770012 0.21046427 0.099026550
    

    A final thing to note, which wasn't asked but can often cause confusion, is that if v is a principle component than -1 * v is also a principle component. Many algorithms for determining them do not explicitly impose an orientation. From the docs

    The signs of the columns of the loadings and scores are arbitrary, and so may differ between different programs for PCA, and even between different builds of R.