Search code examples
rsankey-diagramnetworkd3

sankey diagram in R - data preparation


I have the following data frame, where each patient is a row (I am showing only a sample of it):

df = structure(list(firstY = c("N/A", "1", "3a", "3a", "3b", "1", 
"2", "1", "5", "3b"), secondY = c("N/A", "1", "2", "3a", "4", 
"1", "N/A", "1", "5", "3b"), ThirdY = c("N/A", "1", "N/A", "3b", 
"4", "1", "N/A", "1", "N/A", "3b"), FourthY = c("N/A", "1", "N/A", 
"3a", "4", "1", "N/A", "1", "N/A", "3a"), FifthY = c("N/A", "1", 
"N/A", "2", "5", "1", "N/A", "N/A", "N/A", "3b")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -10L))

I would like to plot a Sankey diagram, which shows the trajectory over time of each patient, and I know that I have to create nodes and links, but I'm having problems transforming the data to the format necessary to accomplish this. Specifically, the most problematic issue is to count how many patients belong to each trajectory, for example, how many patients went in the first year from stage 1 to 2, and all other combinations.

Any help with the data preparation would be appreciated.

The package Alluvial, although simple to understand, does not cope really well in case there is a lot of data.


Solution

  • It's not very clear what you'd like to achieve, because you do not mention the package you'd like to use, but looking at your data, it seems that this could help, if you could use the alluvial package:

    library(alluvial) # sankey plots
    library(dplyr)    # data manipulation
    

    The alluvial functions can use data in wide form like yours, but it needs a frequency column, so we can create it, then do the plot:

    dats_all <- df %>%                                                   # data
                group_by( firstY, secondY, ThirdY, FourthY, FifthY) %>%  # group them
                summarise(Freq = n())                                    # add frequencies
    
     # now plot it
    alluvial( dats_all[,1:5], freq=dats_all$Freq, border=NA )
    

    enter image description here

    In the other hands, if you'd like to use a specific package, you should specify which.


    EDIT

    Using network3D is a bit tricky but you can maybe achieve some nice result from this. You need links and nodes, and have them matched, so first we can create the links:

    # put your df in two columns, and preserve the ordering in many levels (columns) with paste0
    links <- data.frame(source = c(paste0(df$firstY,'_1'),paste0(df$secondY,'_2'),paste0(df$ThirdY,'_3'),paste0(df$FourthY,'_4')),
                      target   = c(paste0(df$secondY,'_2'),paste0(df$ThirdY,'_3'),paste0(df$FourthY,'_4'),paste0(df$FifthY,'_5')))
    
    # now convert as character
    links$source <- as.character(links$source)
    links$target<- as.character(links$target)
    

    Now the nodes are each element in the link in a unique() way:

    nodes <- data.frame(name = unique(c(links$source, links$target)))
    

    Now we need that each nodes has a link (or vice-versa), so we match them and transform in numbers. Note the -1 at the end, because networkD3 is 0 indexes, it means that the numbers (indexes) starts from 0.

    links$source <- match(links$source, nodes$name) - 1
    links$target <- match(links$target, nodes$name) - 1
    links$value <- 1 # add also a value
    

    Now you should be ready to plot your sankey:

    sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
                  Target = 'target', Value = 'value', NodeID = 'name')
    

    enter image description here