Consider the data frame data
created here:
set.seed(123)
data <- data.frame(State =rep(c("NY","MA","FL","GA"), each = 100),
Loc = rep(letters[1:20], each = 20),
ID = sample(600,400,replace = F),
var1 = rnorm(400),
var2 = rnorm(400),
var3 = rnorm(400),
var4 = rnorm(400),
var5 = rnorm(400))
var1:var5
are measurements that were taken on individuals that were randomly sampled from a variety of locations denoted by the Loc
column, which is nested within the larger grouping State
. Each individual has a unique ID
number. Notice the ID
numbers are in no particular order, so measurements are relatively meaningless without their associated grouping variables. I am using the FactoMineR
and factoextra
packages to do PCA and cluster analysis.
lets say I do a PCA and decide that I want to keep the first 3 principle components (I will store the coordinates in an object called ind.cords
:
library(FactoMineR)
library(factoextra)
pca<- PCA(data[,4:8], scale.unit = T, graph = F)
a <- get_pca_ind(pca)
ind.cords <- a$coord[,1:3]
Next I go through the preliminary steps of determining the optimal number of clusters, and I decide on 5. I run the final kmeans to get the clusters:
set.seed(123)
clustering <- kmeans(ind.cords, centers = 5, iter.max = 50, nstart = 25)
clustering
Here is where I am having trouble: fviz_cluster()
makes it easy to plot clusters:
fviz_cluster(clustering, geom = "point", data = ind.cords) + ggtitle("k = 5")
But I want to visualize which observations belong to which clusters using both grouping variables. So I need those columns to use as labels. I can go back to where I created ind.cords
and add the State
Loc
, and ID
columns back to it: ind.cords <- cbind(data[,1:3], ind.cors)
.
From here I can either carry forward by specifying which columns I want to perform operations on (e.g., kmeans(ind.cords[,4:6]) or I can make a new object called input
that just has the numeric columns (e.g., input <- ind.cords[,4:6]
), but in either case, I cant figure out how to get the fviz_
functions to label observations by State
or Loc
. Could someone demonstrate a practical way to do this or explain how to restructure the way I am approaching this analysis so that I can visualize which observations and groups are in which clusters?
Ultimately (unless someone has a better suggestion for visualizing clusters with many groups) I believe it would be easier to visualize the clusters if colored text is used instead of points for the grouping variables (State
or Loc
), and ellipses are drawn around the points to show what clusters they belong to, so this is what I am shooting for in the graphs.
One way would be to just layer on top of it, as I cannot figure out to map anything else in fviz_cluster()
. You can adjust the alpha so that you can make them out. Example with geom_point()
set.seed(123)
data <- data.frame(State =rep(c("NY","MA","FL","GA"), each = 100),
Loc = rep(letters[1:20], each = 20),
ID = sample(600,400,replace = F),
var1 = rnorm(400),
var2 = rnorm(400),
var3 = rnorm(400),
var4 = rnorm(400),
var5 = rnorm(400))
library(FactoMineR)
library(factoextra)
pca <- PCA(data[,4:8], scale.unit = T, graph = F)
a <- get_pca_ind(pca)
ind.cords <- a$coord[,1:3]
ind.cords <- cbind(data[,1:3], ind.cords)
clustering <- kmeans(ind.cords[,4:6], centers = 5, iter.max = 50, nstart = 25)
fviz_cluster(clustering, geom = "point", data = ind.cords[,4:6], shape = 16) + ggtitle("k = 5") +
geom_point(aes(shape = ind.cords$State), alpha = 0.5)
You can also use geom_text()
:
fviz_cluster(clustering, geom = "point", data = ind.cords[,4:6], shape = 16) + ggtitle("k = 5") +
geom_text(aes(label = paste0(ind.cords$State, ":", ind.cords$Loc)), alpha = 0.5, size = 3, nudge_y = 0.1, show.legend = FALSE)
Created on 2020-06-08 by the reprex package (v0.3.0)
EDIT: Setting geom = NULL
also works, so you can suppress the geom_point()
done by fviz_cluster()
:
fviz_cluster(clustering, geom = NULL, data = ind.cords[,4:6], shape = 16) + ggtitle("k = 5") +
geom_text(aes(label = paste0(ind.cords$State, ":", ind.cords$Loc)), size = 3, show.legend = FALSE)
EDIT: And the same with colors for clusters:
fviz_cluster(clustering, geom = NULL, data = ind.cords[,4:6]) +
ggtitle("k = 5") +
geom_text(aes(label = paste0(ind.cords$State, ":", ind.cords$Loc),
color = as.factor(clustering$cluster)),
size = 3, show.legend = FALSE)