I am currently trying to get into principal component analysis and regression. I therefore tried caclulating the principal components of a given matrix by hand and compare it with the results you get out of the r-package rcomp.
The following is the code for doing pca by hand
### compute principal component loadings and scores by hand
df <- matrix(nrow = 5, ncol = 3, c(90,90,60,60,30,
60,90,60,60,30,
90,30,60,90,60))
# calculate covariance matrix to see variance and covariance of
cov.mat <- cov.wt(df)
cen <- cov.mat$center
n.obs <- cov.mat$n.obs
cv <- cov.mat$cov * (1-1/n.obs)
## calcualate the eigenvector and values
edc <- eigen(cv, symmetric = TRUE)
ev <- edc$values
evec <- edc$vectors
cn <- paste0("Comp.", 1L:ncol(cv))
cen <- cov.mat$center
### get loadings (or principal component weights) out of the eigenvectors and compute scores
loadings <- structure(edc$vectors, class = "loadings")
df.scaled <- scale(df, center = cen, scale = FALSE)
scr <- df.scaled %*% evec
I compared my results to the ones obtained by using the princomp-package
pca.mod <- princomp(df)
loadings.mod <- pca.mod$loadings
scr.mod <- pca.mod$scores
scr
scr.mod
> scr
[,1] [,2] [,3]
[1,] -6.935190 32.310906 7.7400588
[2,] -48.968014 -19.339313 -0.3529382
[3,] 1.733797 -8.077726 -1.9350147
[4,] 13.339605 18.519500 -9.5437444
[5,] 40.829802 -23.413367 4.0916385
> scr.mod
Comp.1 Comp.2 Comp.3
[1,] 6.935190 32.310906 7.7400588
[2,] 48.968014 -19.339313 -0.3529382
[3,] -1.733797 -8.077726 -1.9350147
[4,] -13.339605 18.519500 -9.5437444
[5,] -40.829802 -23.413367 4.0916385
So apparently, I did quite good. The computed scores equal at least scale-wise. However: The scores for the first pricipal components differ in the sign. This is not the case for the other two.
This leads to two questions:
When checking this with the mtcars data set, the signs for my first PC were right, however now the second and fourth PC scores are of different signs, compared to the package. I can not make any sense of this. Any help is appreciated!
The signs of eigenvectors and loadings are arbitrary, so there is nothing "wrong" here. The only thing that you should expect to be preserved is the overall pattern of signs within each loadings vector, i.e. in the example above the princomp
answer for PC1 gives +,+,-,-,-
while yours gives -,-,+,+,+
. That's fine. If yours gave e.g. -,+,-,-,+
that would be trouble (because the two would no longer be equivalent up to multiplication by -1).
However, while it's generally true that the signs are arbitrary and hence could vary across algorithms, compilers, operating systems, etc., there's an easy solution in this particular case. princomp
has a fix_sign
argument:
fix_sign:
Should the signs of the loadings and scores be chosen so that the first element of each loading is non-negative?
Try princomp(df,fix_sign=FALSE)$scores
and you'll see that the signs (probably!) line up with your results. (In general the fix_sign=TRUE
option is useful because it breaks the symmetry in a specific way and thus will always result in the same answers across all platforms.)