Search code examples
rpca

PCA scores for only the first principal components are of "wrong" sign


I am currently trying to get into principal component analysis and regression. I therefore tried caclulating the principal components of a given matrix by hand and compare it with the results you get out of the r-package rcomp.

The following is the code for doing pca by hand


### compute principal component loadings and scores by hand

df <- matrix(nrow = 5, ncol = 3, c(90,90,60,60,30,
                                   60,90,60,60,30,
                                   90,30,60,90,60))

# calculate covariance matrix to see variance and covariance of 
cov.mat <- cov.wt(df)
cen <- cov.mat$center
n.obs <- cov.mat$n.obs
cv <- cov.mat$cov * (1-1/n.obs)

## calcualate the eigenvector and values
edc <- eigen(cv, symmetric = TRUE)
ev <- edc$values
evec <- edc$vectors
cn <- paste0("Comp.", 1L:ncol(cv))

cen <- cov.mat$center

### get loadings (or principal component weights) out of the eigenvectors and compute scores

loadings <- structure(edc$vectors, class = "loadings")

df.scaled <- scale(df, center = cen, scale = FALSE)
scr <- df.scaled %*% evec 

I compared my results to the ones obtained by using the princomp-package

pca.mod <- princomp(df)
loadings.mod <- pca.mod$loadings
scr.mod <- pca.mod$scores

scr
scr.mod

> scr
           [,1]       [,2]       [,3]
[1,]  -6.935190  32.310906  7.7400588
[2,] -48.968014 -19.339313 -0.3529382
[3,]   1.733797  -8.077726 -1.9350147
[4,]  13.339605  18.519500 -9.5437444
[5,]  40.829802 -23.413367  4.0916385
> scr.mod
         Comp.1     Comp.2     Comp.3
[1,]   6.935190  32.310906  7.7400588
[2,]  48.968014 -19.339313 -0.3529382
[3,]  -1.733797  -8.077726 -1.9350147
[4,] -13.339605  18.519500 -9.5437444
[5,] -40.829802 -23.413367  4.0916385

So apparently, I did quite good. The computed scores equal at least scale-wise. However: The scores for the first pricipal components differ in the sign. This is not the case for the other two.

This leads to two questions:

  1. I have read that it is no problem multiplying the loadings and the scores of principal components by minus one. Does this hold, when only one of the principal components are of a different sign as well?
  2. What am I doing "wrong" from a computational standpoint? The procedure seems straightforward to me and I dont see what I could change in my own calculations to get the same signs as the princomp-package.

When checking this with the mtcars data set, the signs for my first PC were right, however now the second and fourth PC scores are of different signs, compared to the package. I can not make any sense of this. Any help is appreciated!


Solution

  • The signs of eigenvectors and loadings are arbitrary, so there is nothing "wrong" here. The only thing that you should expect to be preserved is the overall pattern of signs within each loadings vector, i.e. in the example above the princomp answer for PC1 gives +,+,-,-,- while yours gives -,-,+,+,+. That's fine. If yours gave e.g. -,+,-,-,+ that would be trouble (because the two would no longer be equivalent up to multiplication by -1).

    However, while it's generally true that the signs are arbitrary and hence could vary across algorithms, compilers, operating systems, etc., there's an easy solution in this particular case. princomp has a fix_sign argument:

    fix_sign: Should the signs of the loadings and scores be chosen so that the first element of each loading is non-negative?

    Try princomp(df,fix_sign=FALSE)$scores and you'll see that the signs (probably!) line up with your results. (In general the fix_sign=TRUE option is useful because it breaks the symmetry in a specific way and thus will always result in the same answers across all platforms.)