Search code examples
rggplot2dataflowsankey-diagram

sankey/alluvial diagram with percentage and partial fill in R


I would like modify an existing sankey plot using ggplot2 and ggalluvial to make it more appealing

my example is from https://corybrunson.github.io/ggalluvial/articles/ggalluvial.html

library(ggplot2)
library(ggalluvial)

data(vaccinations)
levels(vaccinations$response) <- rev(levels(vaccinations$response))
ggplot(vaccinations,
       aes(x = survey, stratum = response, alluvium = subject,
           y = freq,
           fill = response, label = response)) +
  scale_x_discrete(expand = c(.1, .1)) +
  geom_flow() +
  geom_stratum(alpha = .5) +
  geom_text(stat = "stratum", size = 3) +
  theme(legend.position = "none") +
  ggtitle("vaccination survey responses at three points in time")

Created on 2020-10-01 by the reprex package (v0.3.0)

Now, I would like to change this plot that it looks similar to a plot from https://sciolisticramblings.wordpress.com/2018/11/23/sankey-charts-the-new-pie-chart/, i.e. 1. change absolute to relative values (percentage) 2. add percentage labels and 3. apply partial fill (e.g. "missing" and "never") enter image description here

My approach: I think I could change the axis to percentage with something like: scale_y_continuous(label = scales::percent_format(scale = 100)) However, I am not sure about step 2. and 3.


Solution

  • This could be achieved like so:

    1. Changing to percentages could be achieved by adding a new column to your df with the percentage shares by survey, which can then be mapped on y instead of freq.

    2. To get nice percentage labels you can make use of scale_y_continuous(label = scales::percent_format())

    3. For the partial filling you can map e.g. response %in% c("Missing", "Never") on fill (which gives TRUE for "Missing" and "Never") and set the fill colors via scale_fill_manual

    4. The percentages of each stratum can be added to the label via label = paste0(..stratum.., "\n", scales::percent(..count.., accuracy = .1)) in geom_text where I make use of the variables ..stratum.. and ..count.. computed by stat_stratum.

    library(ggplot2)
    library(ggalluvial)
    library(dplyr)
    
    data(vaccinations)
    levels(vaccinations$response) <- rev(levels(vaccinations$response))
    
    vaccinations <- vaccinations %>% 
      group_by(survey) %>% 
      mutate(pct = freq / sum(freq))
    
    ggplot(vaccinations,
           aes(x = survey, stratum = response, alluvium = subject,
               y = pct,
               fill = response %in% c("Missing", "Never"), 
               label = response)) +
      scale_x_discrete(expand = c(.1, .1)) +
      scale_y_continuous(label = scales::percent_format()) +
      scale_fill_manual(values = c(`TRUE` = "cadetblue1", `FALSE` = "grey50")) +
      geom_flow() +
      geom_stratum(alpha = .5) +
      geom_text(aes(label = paste0(..stratum.., "\n", scales::percent(..count.., accuracy = .1))), stat = "stratum", size = 3) +
      theme(legend.position = "none") +
      ggtitle("vaccination survey responses at three points in time")