Search code examples
rsankey-diagramggalluvialriverplot

Plotting Sankey diagram with muiltiple stages but same node labels in R


I would like to plot a sankey diagram to show how observations migrate from one risk level to the other over multiple stages (in this case years). Thus, the risk level labels are the same in each year. X axis should have Years, Y axis should have proportion as illustrated in the picture attached. Below is the code I attempted. Thanks!

# Sample data frame
library(ggsankeyfier)
library(dplyr)
library(ggplot2)

df <- data.frame(
  ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5),
  risk_level = c("High", "High", "High", 
                 "Low", "Low", "Very low", 
                 "Low", "Low", "Low", 
                 "Low", "Moderate", "Low", 
                 "Moderate", "High", "High"),
  Year = c(2022, 2023, 2024, 
           2022, 2023, 2024, 
           2022, 2023, 2024, 
           2022, 2023, 2024, 
           2022, 2023, 2024))

df1 <- df %>% 
  group_by(risk_level, Year) %>%
  summarise(count = n(), .groups = "drop_last") %>% 
  group_by(Year) %>%
  mutate(proportion = count / sum(count)) %>%
  ungroup() 

# Converting the data for the Sankey diagram
df_pivot <-  pivot_stages_longer(df1, stages_from = c("Year",
                                                        "risk_level"),
                                    ## the column that represents the size of the flows:
                                    values_from = "proportion")

#attempting to plot the sankey diagram
ggplot(df_pivot, aes(x = stage, y = proportion, group = node,
           connector = connector, edge_id = edge_id, fill = node)) +
  geom_sankeyedge(v_space = "auto") +
  geom_sankeynode(v_space = "auto")

Solution

  • The issue is the wrong setup of the data. To achieve your desired result reshape to wide, then compute the counts and the proportion for each unique path of risk levels along the stages in the data. Afterwards use pivot_stages_longer to reshape the data to the long format required by ggsankeyfier :

    library(ggsankeyfier)
    library(ggplot2)
    library(dplyr)
    library(tidyr)
    
    df_pivot <- df |>
      mutate(
        risk_level = factor(
          risk_level, c("Very low", "Low", "Moderate", "High")
        )
      ) |> 
      tidyr::pivot_wider(names_from = Year, values_from = risk_level) |> 
      count(across(-ID)) |> 
      mutate(prop = n / sum(n)) |> 
      pivot_stages_longer(
        stages_from = c("2022", "2023", "2024"),
        values_from = c("prop", "n")
      )
    
    # attempting to plot the sankey diagram
    ggplot(df_pivot, aes(
      x = stage, y = prop, group = node,
      connector = connector, edge_id = edge_id, fill = node
    )) +
      geom_sankeyedge(v_space = "auto") +
      geom_sankeynode(v_space = "auto", order = "as_is")
    

    enter image description here