Search code examples
rggplot2data-visualizationcluster-analysisk-means

Practical way to keep grouping variables associated with observations when doing PCA and cluster analysis


Consider the data frame data created here:

set.seed(123)
data <- data.frame(State =rep(c("NY","MA","FL","GA"), each = 100),
                   Loc = rep(letters[1:20], each = 20),
                   ID = sample(600,400,replace = F),
                   var1 = rnorm(400),
                   var2 = rnorm(400),
                   var3 = rnorm(400),
                   var4 = rnorm(400),
                   var5 = rnorm(400))

var1:var5 are measurements that were taken on individuals that were randomly sampled from a variety of locations denoted by the Loc column, which is nested within the larger grouping State. Each individual has a unique ID number. Notice the ID numbers are in no particular order, so measurements are relatively meaningless without their associated grouping variables. I am using the FactoMineR and factoextra packages to do PCA and cluster analysis. lets say I do a PCA and decide that I want to keep the first 3 principle components (I will store the coordinates in an object called ind.cords:

library(FactoMineR)
library(factoextra)
pca<- PCA(data[,4:8], scale.unit = T, graph = F)
a <- get_pca_ind(pca)
ind.cords <- a$coord[,1:3]

Next I go through the preliminary steps of determining the optimal number of clusters, and I decide on 5. I run the final kmeans to get the clusters:

set.seed(123)
clustering <- kmeans(ind.cords, centers = 5, iter.max = 50, nstart = 25)
clustering

Here is where I am having trouble: fviz_cluster() makes it easy to plot clusters: fviz_cluster(clustering, geom = "point", data = ind.cords) + ggtitle("k = 5") But I want to visualize which observations belong to which clusters using both grouping variables. So I need those columns to use as labels. I can go back to where I created ind.cords and add the State Loc, and ID columns back to it: ind.cords <- cbind(data[,1:3], ind.cors). From here I can either carry forward by specifying which columns I want to perform operations on (e.g., kmeans(ind.cords[,4:6]) or I can make a new object called input that just has the numeric columns (e.g., input <- ind.cords[,4:6]), but in either case, I cant figure out how to get the fviz_ functions to label observations by State or Loc. Could someone demonstrate a practical way to do this or explain how to restructure the way I am approaching this analysis so that I can visualize which observations and groups are in which clusters? Ultimately (unless someone has a better suggestion for visualizing clusters with many groups) I believe it would be easier to visualize the clusters if colored text is used instead of points for the grouping variables (State or Loc), and ellipses are drawn around the points to show what clusters they belong to, so this is what I am shooting for in the graphs.


Solution

  • One way would be to just layer on top of it, as I cannot figure out to map anything else in fviz_cluster(). You can adjust the alpha so that you can make them out. Example with geom_point()

    set.seed(123)
    data <- data.frame(State =rep(c("NY","MA","FL","GA"), each = 100),
                       Loc = rep(letters[1:20], each = 20),
                       ID = sample(600,400,replace = F),
                       var1 = rnorm(400),
                       var2 = rnorm(400),
                       var3 = rnorm(400),
                       var4 = rnorm(400),
                       var5 = rnorm(400))
    
    library(FactoMineR)
    library(factoextra)
    
    pca <- PCA(data[,4:8], scale.unit = T, graph = F)
    a <- get_pca_ind(pca)
    ind.cords <- a$coord[,1:3]
    ind.cords <- cbind(data[,1:3], ind.cords)
    
    clustering <- kmeans(ind.cords[,4:6], centers = 5, iter.max = 50, nstart = 25)
    
    fviz_cluster(clustering, geom = "point", data = ind.cords[,4:6], shape = 16) + ggtitle("k = 5") +
      geom_point(aes(shape = ind.cords$State), alpha = 0.5)
    

    You can also use geom_text():

    
    fviz_cluster(clustering, geom = "point", data = ind.cords[,4:6], shape = 16) + ggtitle("k = 5") +
      geom_text(aes(label = paste0(ind.cords$State, ":", ind.cords$Loc)), alpha = 0.5, size = 3, nudge_y = 0.1, show.legend = FALSE)
    

    Created on 2020-06-08 by the reprex package (v0.3.0)

    EDIT: Setting geom = NULL also works, so you can suppress the geom_point() done by fviz_cluster():

    
    fviz_cluster(clustering, geom = NULL, data = ind.cords[,4:6], shape = 16) + ggtitle("k = 5") +
      geom_text(aes(label = paste0(ind.cords$State, ":", ind.cords$Loc)), size = 3, show.legend = FALSE)
    

    EDIT: And the same with colors for clusters:

    
    fviz_cluster(clustering, geom = NULL, data = ind.cords[,4:6]) + 
      ggtitle("k = 5") +
      geom_text(aes(label = paste0(ind.cords$State, ":", ind.cords$Loc),
                    color = as.factor(clustering$cluster)),
                size = 3, show.legend = FALSE)