After generating a simulated data set with 20 observations in each of three classes (i.e., 60 observations total), and 50 variables, I need to plot the first two principal component score vectors, using a different color to indicate the three unique classes.
I believe I can create the simulated data set (please verify), but I am having issues figuring out how to color the classes and plot. I need to make sure the three classes appear separated in the plot (or else I need to re-run the simulated data).
#for the response variable y (60 values - 3 classes 1,2,3 - 20 observations per class)
y <- rep(c(1,2,3),20)
#matrix of 50 variables i.e. 50 columns and 60 rows i.e. 60x50 dimensions (=3000 table cells)
x <- matrix( rnorm(3000), ncol=50)
xymatrix <- cbind(y,x)
dim(x)
[1] 60 50
dim(xymatrix)
[1] 60 51
pca=prcomp(xymatrix, scale=TRUE)
How should I correctly plot and color this principal component analysis as noted above? Thank you.
If I understand your question correctly, ggparcoord
in Gally
package would help you.
library(GGally)
y <- rep(c(1,2,3), 20)
# matrix of 50 variables i.e. 50 columns and 60 rows
# i.e. 60x50 dimensions (=3000 table cells)
x <- matrix(rnorm(3000), ncol=50)
xymatrix <- cbind(y,x)
pca <- prcomp(xymatrix, scale=TRUE)
# Principal components score and group label 'y'
pc_label <- data.frame(pca$x, y=as.factor(y))
# Plot the first two principal component scores of each samples
ggparcoord(data=pc_label, columns=1:2, groupColumn=ncol(pc_label))
However, I think it makes more sense to do PCA on x
rather than xymatrix
that includes the target y
. So the following codes should be more appropriate in your case.
pca <- prcomp(x, scale=TRUE)
pc_label <- data.frame(pca$x, y=as.factor(y))
ggparcoord(data=pc_label, columns=1:2, groupColumn=ncol(pc_label))
If you want a scatter plot of first two principal component scores, you can do it using ggplot
.
library(ggplot2)
ggplot(data=pc_label) +
geom_point(aes(x=PC1, y=PC2, colour=y))