Search code examples
rloadingscatter-plotpcapsych

r: pca and plotting observations in principal component space


I perform an express PCA analysis and visualization on a small dataset (20 observations, 17 variables, most of them highly correlated). I use library(psych) with ready-made function principal() doing most job. I got standartized loading matrix. Sample of output is as follows (Vi are variables; only several shown):

      PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8   PC9
V1   0.20 -0.79  0.46  0.06 -0.20  0.22 -0.06  0.03 -0.15
V2   0.18 -0.86  0.37 -0.12 -0.09  0.17 -0.11 -0.01 -0.05
V3   0.72  0.42 -0.16  0.23 -0.35 -0.17  0.21 -0.05  0.03
V4   0.81  0.34 -0.21  0.34 -0.22  0.03 -0.01 -0.04  0.00
V5   0.61 -0.38 -0.34 -0.02  0.37 -0.27  0.35  0.03 -0.12
V6   0.80  0.31  0.02 -0.08 -0.38  0.20 -0.04 -0.13 -0.19

I want to retain 2 or 3 principal components (other tests suggest doing so) and to draw a scatter plot of my data in the space PC1-PC2 or 3D PC1-PC2-PC3. How it is possible to do this with R?

Here is the example of raw data (first several lines) over parameters.

field,V1,V2,V3,V4,V5,V6
Shah-Deniz,37.5,70,16200,23000,300,250
Sanate,180,150,14000,17000,175,190
Kern-River,275,250,13000,17000,64,240
East Texas,90,100,11000,12000,520,160
Smackover,35,25,13700,15000,50,170
South Pass,45,60,14100,15000,61,190
Monroe,27,30,14400,15000,72,150
Minas,170,230,6500,7300,300,90

I am aware, that the solution is by somehow multiplying of this raw matrix by loadings matrix to obtain projections on PCi space, but I am a bit confused with this matrix multiplication and its order after several trials. And the second challenge is scatter plotting itself (2D or 3D) with labelling all points with observation numbers. Maybe there already is a function within the package, which does this matrix algebra work and can visualize the result from scratch?

Update. One confusion comes from the fact that variables in raw data are incomparable (some are in km, some in m, then km^2, or mln.tons). So at some stage the scaled data matrix should come into play?


Solution

  • I'm not familiar with the psych library, but you can do this easily in base R

    X = data.frame(matrix(rnorm(1:100), nrow = 10)) # Make example dataframe
    pca = princomp(X, cor = T) # Perform PCA. Note cor = T should get around your 'variables on different scales' issues as correlation matrix is scale-free.
    scores = pca$scores # Extract PCA scores
    windows() # Plot scores for first 2 pcs
    plot(scores[, 1], scores[, 2], xlab = "PC1", ylab = "PC2", type = "n")
    text(scores[, 1], scores[, 2], row.names(X), cex = 0.8) #you can replace row.names(X) with whatever your observations are called
    

    Not sure how to do the 3d scatterplot off the top of my head, but with PCAs I always just do multiple 2d plots e.g. PC1 vs. PC2, PC1 vs. PC3 etc.