Search code examples
rplotpca

Plotting two principal component score vectors, using a different color to indicate three unique classes


After generating a simulated data set with 20 observations in each of three classes (i.e., 60 observations total), and 50 variables, I need to plot the first two principal component score vectors, using a different color to indicate the three unique classes.

I believe I can create the simulated data set (please verify), but I am having issues figuring out how to color the classes and plot. I need to make sure the three classes appear separated in the plot (or else I need to re-run the simulated data).

#for the response variable y (60 values - 3 classes 1,2,3  - 20 observations per class)
y <- rep(c(1,2,3),20)

#matrix of 50 variables i.e. 50 columns and 60 rows i.e. 60x50 dimensions (=3000 table cells)   
x <- matrix( rnorm(3000), ncol=50)

xymatrix <- cbind(y,x)
dim(x)
[1] 60 50
dim(xymatrix)
[1] 60 51
pca=prcomp(xymatrix, scale=TRUE)

How should I correctly plot and color this principal component analysis as noted above? Thank you.


Solution

  • If I understand your question correctly, ggparcoord in Gally package would help you.

    library(GGally)
    y <- rep(c(1,2,3), 20)
    
    # matrix of 50 variables i.e. 50 columns and 60 rows 
    # i.e. 60x50 dimensions (=3000 table cells)   
    x <- matrix(rnorm(3000), ncol=50)
    
    xymatrix <- cbind(y,x)
    pca <- prcomp(xymatrix, scale=TRUE)
    
    # Principal components score and group label 'y'
    pc_label <- data.frame(pca$x, y=as.factor(y))
    
    # Plot the first two principal component scores of each samples
    ggparcoord(data=pc_label, columns=1:2, groupColumn=ncol(pc_label))
    

    However, I think it makes more sense to do PCA on x rather than xymatrix that includes the target y. So the following codes should be more appropriate in your case.

    pca <- prcomp(x, scale=TRUE)
    
    pc_label <- data.frame(pca$x, y=as.factor(y))
    
    ggparcoord(data=pc_label, columns=1:2, groupColumn=ncol(pc_label))
    

    If you want a scatter plot of first two principal component scores, you can do it using ggplot.

    library(ggplot2)
    
    ggplot(data=pc_label) + 
      geom_point(aes(x=PC1, y=PC2, colour=y))