Search code examples
rtimesankey-diagram

How to create a sankey diagram in R showing changes over time for the same node?


I am trying to create a Sankey diagram of my data.

For each therapy, individuals are followed over time. I would like to have one node "Therapy" (categorical variable with different therapy names) repeated over time and x axis accounting for the time. Any ideas? I really appreciate any help.

This is what I have tried to so far:

### install and load packages
install.packages("ggplot2")
install.packages("readxl")
install.packages("ggforce")

# load packages
library(ggplot2)
library(readxl)
library(ggforce)

### read dataset
dataset_new <- read_excel("Made_up_dataset_new.xlsx")
df_new <- as.data.frame(dataset_new)

df_new$Unit <- 1

df_sankey <- df_new[c("Therapy", "Frequency", "Continuous_time","Unit")]

# transform dataframe into appropriate format
df_sankey <- gather_set_data(df_sankey, 1:3)

# define axis-width / sep parameters once here, to be used by each geom layer in the plot
aw <- 0.1
sp <- 0.1

ggplot(df_sankey, 
       aes(x = x, id = id, split = y, value = Unit)) +
  geom_parallel_sets(aes(fill = Therapy), alpha = 0.3, 
                     axis.width = aw, sep = sp) +
  geom_parallel_sets_axes(axis.width = aw, sep = sp) +
  geom_parallel_sets_labels(colour = "white", 
                            angle = 0, size = 3,
                            axis.width = aw, sep = sp) +
  theme_minimal()

But the result is not what I want because time is compacted on the y axis, and not on the x axis, if that makes sense?

I appreciate any help!


Solution

  • Well, you have several options. The first solution which worked for me was ggplot / geom_flow:

    # requires(ggplot2)
    # requires(ggalluvial)
    
    # faking the data for 20 patients
    set.seed(42)
    individual <- as.character(rep(1:20,each=5))
    timeperiod <- paste0(rep(c(0, 18,36,54,72),20),"_week")
    therapy <- factor(sample(c("Etanercept", "Infliximab", "Rituximab",  "Adalimumab","Missing"), 100, replace=T))
    d <- data.frame(individual, timeperiod, therapy)
    head(d)
    
    # Plotting it
    ggplot(d, aes(x = timeperiod, stratum = therapy, alluvium = individual, fill = therapy, label = therapy)) +
      scale_fill_brewer(type = "qual", palette = "Set2") +
      geom_flow(stat = "alluvium", lode.guidance = "rightleft", color = "darkgray") +
      geom_stratum() +
      theme(legend.position = "bottom") +
      ggtitle("Treatment across observation period")
    

    enter image description here

    The argument stat = "alluvium" in geom_flow should allow to track individual patients, But if you want, you can also merge the flows:

    ggplot(d, aes(x = timeperiod, stratum = therapy, alluvium = individual, fill = therapy, label = therapy)) +
      scale_fill_brewer(type = "qual", palette = "Set2") +
      geom_flow(color = "darkgray") +
      geom_stratum() +
      theme(legend.position = "bottom") +
      ggtitle("Treatment across observation period")
    

    enter image description here

    EDIT 1: If you want that for some patients the flow is discontinued (e.g. therapy has finished), you can easily do it by setting these patients as NAs:

    # setting 3 pantients as NA for the last timepoint
    d[which(d$individual==3 & d$timeperiod=="72_week"), ]["therapy"] <- NA 
    d[which(d$individual==6 & d$timeperiod=="72_week"), ]["therapy"] <- NA 
    d[which(d$individual==9 & d$timeperiod=="72_week"), ]["therapy"] <- NA 
    
    # making the plot:
    ggplot(d, aes(x = timeperiod, stratum = therapy, alluvium = individual, fill = therapy, label = therapy)) +
    scale_fill_brewer(type = "qual", palette = "Set2") +
    geom_flow(stat = "alluvium", lode.guidance = "rightleft", color = "darkgray") + 
    geom_stratum(alpha=0.75) +
    theme(legend.position = "bottom") +
    ggtitle("Treatment across observation period")
    

    enter image description here Now, to be honest, also the networkD3 worked, but I just didn´t manage to make it look good enough.

    EDIT 2:

    • You can also use geom_alluvium instead of geom_flow. The main (visual) difference between them is that in geom_flow the color of flow is inherited from the neighboring nodes (either source or target). In geom_alluvium it is instead inherited from the first node - e.g. flow will not change color at the when passing through the nodes.

    • If you want to combine the chart with another plot, the easiest way would seem to use par(mfrow=c(1,2)).