Search code examples
rggplot2hierarchical-clusteringmeltfacet-wrap

How to represent subclusters within clusters on variables with a line graph in R


I want to represent subclusters within clusters on variables using line graphs. I am using R.

I have two categorical variables (clusters; denoted below as a,b,c) which are nested such that each cluster has multiple subclusters (a1, a2, a3, b1, b2 and so on) within it.

I also have multiple numeric variables which I want to display by cluster and subcluster. I would like to use line graphs to display the means of the numeric variables. I have succeeded in displaying the means using summarize(),melt(), and ggplot with facet_wrap to separate the clusters. However, I don't know how I could display the subclusters.

I want to display the cluster means in with a thick black line, while displaying the subcluster means on the same graph, but "greyed out" and thinner to de-emphasize them. I have successfully used facet_wrap to separate the clusters, but I cannot think of how to get the subcluster means on the same graph.

I generated this dataset to illustrate the issue:

library(reshape)
library(tidyverse)

cases <- c(1:27)
cluster1 <- sort(rep(c("a","b","c"),9))
cluster2 <- sort(rep(c("a1","a2","a3","b1","b2","b3","c1","c2","c3"),3))

v1 <- runif(27,min = -2, max = 2)
v2 <- runif(27,min = -3, max = 1)
v3 <- runif(27,min = -4, max = 0)

df <- data.frame(cases,cluster1,cluster2,v1,v2,v3)

means.df <- subset(df) %>%
  group_by(cluster1)%>%
  summarise_at(vars(c(3:5)),mean)
means.df <- as.data.frame(means.df)

melt.df <- melt(means.df,id ="cluster1")

ggplot(data = melt.df,aes(x = variable, y = value, group = cluster1))+
  geom_line()+
  geom_point()+
  ylab("Mean")+
  theme(axis.text.x = element_text(angle = 90,hjust = 1,vjust=0.3))+
  facet_wrap(facets="cluster1")

Thank you in advance. Please let me know if I can provide more details.


Solution

  • You could achieve your desired result by

    1. Creating a dataframe with the means by cluster1 and cluster2
    2. Passing this dataframe to the data argument of a second geom_line
    library(reshape)
    library(tidyverse)
    
    set.seed(123)
    
    means.df <- df %>%
      group_by(cluster1) %>%
      summarise(across(starts_with("v"), mean)) %>% 
      as.data.frame()
    
    melt.df <- melt(means.df, "cluster1")
    
    means.df2 <- df %>%
      group_by(cluster1, cluster2) %>%
      summarise(across(starts_with("v"), mean))%>% 
      as.data.frame()
    
    melt.df2 <- melt(means.df2, c("cluster1", "cluster2"))
    
    ggplot(data = melt.df, mapping = aes(x = variable, y = value, group = cluster1)) +
      geom_line(data = melt.df2, aes(group = cluster2), color = "grey", alpha = .6) +
      geom_line(color = "black") +
      geom_point() +
      ylab("Mean") +
      theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.3)) +
      facet_wrap(facets = "cluster1")
    

    EDIT To label your subclusters you could use geom_text. As I guess that you want only one label I filtered the dataset for the last category mapped on x. This way the labels are added on the right of the line.

    base +
      geom_text(data = filter(melt.df2, variable == "v3"), aes(label = cluster2), hjust = -.1, color = "black")
    

    However, depending on the overlaps using geom_text is IMHO not the best way to add the labels. At least for the random example data I would suggest to switch to ggrepel::geom_text_repel which automatically will shift the labels to avoid overlapping labels:

    base +
      ggrepel::geom_text_repel(data = filter(melt.df2, variable == "v3"), aes(label = cluster2), nudge_x = .25, hjust = 0, color = "black", segment.size = .25)