Calculating new variables from Principal Components - PCA in R

For purpose of learning PCA in R, I have run princomp() function (from MASS package) on iris dataset. I have followed following steps:

 library(MASS)
 irispca<-princomp(iris[-5])
 summary(irispca)
 irispca$loadings

In order to calculate principal components, I have used output of loadings in this way:

 iris_temp2 <- iris
 iris_temp2$Comp.1 <- with(iris_temp2,Sepal.Length*0.361+Petal.Length*0.857+Petal.Width*0.358)
 iris_temp2$Comp.2 <- with(iris_temp2,Sepal.Length*(-0.657)+Sepal.Width*(-0.73)+Petal.Length*0.173)
 iris_temp2$Comp.3 <- with(iris_temp2,Sepal.Length*(-0.582)+Sepal.Width*0.598+Petal.Width*0.546)
 iris_temp2$Comp.4 <- with(iris_temp2,Sepal.Length*0.315+Sepal.Width*(-0.32)+Petal.Length*(-0.48)+Petal.Width*0.754)
 iris_temp2 <- with(iris_temp2, iris_temp2[order(Comp.1,Comp.2,Comp.3,Comp.4),])

At last, I sorted the dataset. I have also come to know that scores gives out the same above thing i.e. scores are calculated by multiplying scaled data (on which you run PCA) with loadings. Hence, I thought of comparing output of scores and output of iris_temp2 (featuring four components).

 iris_temp1 <- as.data.frame(irispca$scores)
 iris_temp1 <- with(iris_temp1, iris_temp1[order(Comp.1,Comp.2,Comp.3,Comp.4),])

However, when I do head(iris_temp1) and head(iris_temp2[,6:9]), the outputs do not match.

I would request you people to point out the reason behind this observation. Is there anything that I have misunderstood? If you need any other input from my end, please do let me know.

Reference materials that I have used are: http://yatani.jp/teaching/doku.php?id=hcistats:pca and https://www.youtube.com/watch?v=I5GxNzKLIoU&spfreload=5.

Thanks Shankar

Solution

princomp does not reorder the data, each row is transformed to the scores, so there is no need to reorder the data when comparing. The scores involve both a demeaning of the data and a change of basis by the matrix of eigenvalues.

What this means is that firstly you need to demean your data, i.e.

library(MASS)
irispca<-princomp(iris[-5])

iris2 <- as.matrix(iris[-5])
iris2 <- sweep(iris2, MARGIN=2, irispca$center, FUN="-")

Then it is important to realize that the print method for princomp objects rounds values for display purpose

irispca$loadings

Loadings:
             Comp.1 Comp.2 Comp.3 Comp.4
Sepal.Length  0.361 -0.657  0.582  0.315
Sepal.Width         -0.730 -0.598 -0.320
Petal.Length  0.857  0.173        -0.480
Petal.Width   0.358        -0.546  0.754

But when we actually inspect one of the components we see its full values

irispca$loadings[,1]

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
  0.36138659  -0.08452251   0.85667061   0.35828920

Taking this into account we have

is1 <- list()
is1$Comp.1 <- iris2 %*% irispca$loadings[,1]
is1$Comp.2 <- iris2 %*% irispca$loadings[,2]
is1$Comp.3 <- iris2 %*% irispca$loadings[,3]
is1$Comp.4 <- iris2 %*% irispca$loadings[,4]
score1 <- as.data.frame(is1)

which gives

head(score1, 2)

Comp.1     Comp.2     Comp.3      Comp.4
-2.684126 -0.3193972 0.02791483 0.002262437
 2.714142  0.1770012 0.21046427 0.099026550


 head(irispca$scores, 2)
         Comp.1     Comp.2     Comp.3      Comp.4
 [1,] -2.684126 -0.3193972 0.02791483 0.002262437
 [2,] -2.714142  0.1770012 0.21046427 0.099026550

A final thing to note, which wasn't asked but can often cause confusion, is that if v is a principle component than -1 * v is also a principle component. Many algorithms for determining them do not explicitly impose an orientation. From the docs

The signs of the columns of the loadings and scores are arbitrary, and so may differ between different programs for PCA, and even between different builds of R.