I perform an express PCA analysis and visualization on a small dataset (20 observations, 17 variables, most of them highly correlated). I use library(psych)
with ready-made function principal()
doing most job. I got standartized loading matrix. Sample of output is as follows (Vi are variables; only several shown):
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
V1 0.20 -0.79 0.46 0.06 -0.20 0.22 -0.06 0.03 -0.15
V2 0.18 -0.86 0.37 -0.12 -0.09 0.17 -0.11 -0.01 -0.05
V3 0.72 0.42 -0.16 0.23 -0.35 -0.17 0.21 -0.05 0.03
V4 0.81 0.34 -0.21 0.34 -0.22 0.03 -0.01 -0.04 0.00
V5 0.61 -0.38 -0.34 -0.02 0.37 -0.27 0.35 0.03 -0.12
V6 0.80 0.31 0.02 -0.08 -0.38 0.20 -0.04 -0.13 -0.19
I want to retain 2 or 3 principal components (other tests suggest doing so) and to draw a scatter plot of my data in the space PC1-PC2 or 3D PC1-PC2-PC3. How it is possible to do this with R?
Here is the example of raw data (first several lines) over parameters.
field,V1,V2,V3,V4,V5,V6
Shah-Deniz,37.5,70,16200,23000,300,250
Sanate,180,150,14000,17000,175,190
Kern-River,275,250,13000,17000,64,240
East Texas,90,100,11000,12000,520,160
Smackover,35,25,13700,15000,50,170
South Pass,45,60,14100,15000,61,190
Monroe,27,30,14400,15000,72,150
Minas,170,230,6500,7300,300,90
I am aware, that the solution is by somehow multiplying of this raw matrix by loadings matrix to obtain projections on PCi space, but I am a bit confused with this matrix multiplication and its order after several trials. And the second challenge is scatter plotting itself (2D or 3D) with labelling all points with observation numbers. Maybe there already is a function within the package, which does this matrix algebra work and can visualize the result from scratch?
Update. One confusion comes from the fact that variables in raw data are incomparable (some are in km, some in m, then km^2, or mln.tons). So at some stage the scaled data matrix should come into play?
I'm not familiar with the psych
library, but you can do this easily in base R
X = data.frame(matrix(rnorm(1:100), nrow = 10)) # Make example dataframe
pca = princomp(X, cor = T) # Perform PCA. Note cor = T should get around your 'variables on different scales' issues as correlation matrix is scale-free.
scores = pca$scores # Extract PCA scores
windows() # Plot scores for first 2 pcs
plot(scores[, 1], scores[, 2], xlab = "PC1", ylab = "PC2", type = "n")
text(scores[, 1], scores[, 2], row.names(X), cex = 0.8) #you can replace row.names(X) with whatever your observations are called
Not sure how to do the 3d scatterplot off the top of my head, but with PCAs I always just do multiple 2d plots e.g. PC1 vs. PC2, PC1 vs. PC3 etc.