Search code examples
rggplot2dplyrsankey-diagramggalluvial

Alluvial plot with 2 different sources but a converging/shared variable [R]


I have experience with making alluvial plots using the ggalluvial package. However, I have run in to an issue where I am trying to create an alluvial plot with two different sources that converge onto 1 variable.

here is example data

library(dplyr)
library(ggplot2)
library(ggalluvial)

data <- data.frame(
  unique_alluvium_entires = seq(1:10),
  label_1 = c("A", "B", "C", "D", "E", rep(NA, 5)),
  label_2 = c(rep(NA, 5), "F", "G", "H", "I", "J"),
  shared_label = c("a", "b", "c", "c", "c", "c", "c", "a", "a", "b")
)

here is the code I use to make the plot

#prep the data
data <- data %>%
  group_by(shared_label) %>%
  mutate(freq = n())

data <- reshape2::melt(data, id.vars = c("unique_alluvium_entires", "freq"))
data$variable <- factor(data$variable, levels = c("label_1", "shared_label", "label_2"))

#ggplot
ggplot(data,
       aes(x = variable, stratum = value, alluvium = unique_alluvium_entires,
           y = freq, fill = value, label = value)) +
  scale_x_discrete(expand = c(.1, .1)) + 
  geom_flow() +
  geom_stratum(color = "grey", width = 1/4, na.rm = TRUE) +
  geom_text(stat = "stratum", size = 4) +
  theme_void() +
  theme(
   axis.text.x = element_text(size = 12, face = "bold")
  )

resulting plot (apparently I cannot embed images yet)

As you can see, I can remove the NA values, but the shared_label does not properly "stack". Each unique row should stack on top of each other in the shared_label column. This would also fix the sizing issue so that they are equal size along the y axis.

Any ideas how to fix this? I have tried ggsankey but the same issue arises and I cannot remove NA values. Any tips is greatly appreciated!


Solution

  • This plot is the expected result of the "flow" statistical transformation, which is the default for the "flow" graphical object. (That is, geom_flow() = geom_flow(stat = "flow").) It looks like what you want is to specify the "alluvium" statistical transformation instead. Below i've used all your code but only copied and edited the ggplot() call.

    #ggplot
    ggplot(data,
           aes(x = variable, stratum = value, alluvium = unique_alluvium_entires,
               y = freq, fill = value, label = value)) +
      scale_x_discrete(expand = c(.1, .1)) +
      geom_flow(stat = "alluvium") +  # <-- specify alternate stat
      geom_stratum(color = "grey", width = 1/4, na.rm = TRUE) +
      geom_text(stat = "stratum", size = 4) +
      theme_void() +
      theme(
        axis.text.x = element_text(size = 12, face = "bold")
      )
    #> Warning: Removed 2 rows containing missing values (geom_text).
    

    Created on 2021-12-10 by the reprex package (v2.0.1)