For purpose of learning PCA in R, I have run princomp() function (from MASS package) on iris dataset. I have followed following steps:
library(MASS)
irispca<-princomp(iris[-5])
summary(irispca)
irispca$loadings
In order to calculate principal components, I have used output of loadings in this way:
iris_temp2 <- iris
iris_temp2$Comp.1 <- with(iris_temp2,Sepal.Length*0.361+Petal.Length*0.857+Petal.Width*0.358)
iris_temp2$Comp.2 <- with(iris_temp2,Sepal.Length*(-0.657)+Sepal.Width*(-0.73)+Petal.Length*0.173)
iris_temp2$Comp.3 <- with(iris_temp2,Sepal.Length*(-0.582)+Sepal.Width*0.598+Petal.Width*0.546)
iris_temp2$Comp.4 <- with(iris_temp2,Sepal.Length*0.315+Sepal.Width*(-0.32)+Petal.Length*(-0.48)+Petal.Width*0.754)
iris_temp2 <- with(iris_temp2, iris_temp2[order(Comp.1,Comp.2,Comp.3,Comp.4),])
At last, I sorted the dataset. I have also come to know that scores gives out the same above thing i.e. scores are calculated by multiplying scaled data (on which you run PCA) with loadings. Hence, I thought of comparing output of scores and output of iris_temp2 (featuring four components).
iris_temp1 <- as.data.frame(irispca$scores)
iris_temp1 <- with(iris_temp1, iris_temp1[order(Comp.1,Comp.2,Comp.3,Comp.4),])
However, when I do head(iris_temp1) and head(iris_temp2[,6:9]), the outputs do not match.
I would request you people to point out the reason behind this observation. Is there anything that I have misunderstood? If you need any other input from my end, please do let me know.
Reference materials that I have used are: http://yatani.jp/teaching/doku.php?id=hcistats:pca and https://www.youtube.com/watch?v=I5GxNzKLIoU&spfreload=5.
Thanks Shankar
princomp
does not reorder the data, each row is transformed to the scores, so there is no need to reorder the data when comparing. The scores involve both a demeaning of the data and a change of basis by the matrix of eigenvalues.
What this means is that firstly you need to demean your data, i.e.
library(MASS)
irispca<-princomp(iris[-5])
iris2 <- as.matrix(iris[-5])
iris2 <- sweep(iris2, MARGIN=2, irispca$center, FUN="-")
Then it is important to realize that the print method for princomp
objects rounds values for display purpose
irispca$loadings
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4
Sepal.Length 0.361 -0.657 0.582 0.315
Sepal.Width -0.730 -0.598 -0.320
Petal.Length 0.857 0.173 -0.480
Petal.Width 0.358 -0.546 0.754
But when we actually inspect one of the components we see its full values
irispca$loadings[,1]
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.36138659 -0.08452251 0.85667061 0.35828920
Taking this into account we have
is1 <- list()
is1$Comp.1 <- iris2 %*% irispca$loadings[,1]
is1$Comp.2 <- iris2 %*% irispca$loadings[,2]
is1$Comp.3 <- iris2 %*% irispca$loadings[,3]
is1$Comp.4 <- iris2 %*% irispca$loadings[,4]
score1 <- as.data.frame(is1)
which gives
head(score1, 2)
Comp.1 Comp.2 Comp.3 Comp.4
-2.684126 -0.3193972 0.02791483 0.002262437
2.714142 0.1770012 0.21046427 0.099026550
head(irispca$scores, 2)
Comp.1 Comp.2 Comp.3 Comp.4
[1,] -2.684126 -0.3193972 0.02791483 0.002262437
[2,] -2.714142 0.1770012 0.21046427 0.099026550
A final thing to note, which wasn't asked but can often cause confusion, is that if v
is a principle component than -1 * v
is also a principle component. Many algorithms for determining them do not explicitly impose an orientation. From the docs
The signs of the columns of the loadings and scores are arbitrary, and so may differ between different programs for PCA, and even between different builds of R.