Search code examples
javascriptrd3.jshtmlwidgetsnetworkd3

How does sankeyNetwork set x axis position


I am looking through the documentation and tutorials for building a sankey plot using in via networkD3::sankeyNetwork().

I can get this working using someone else's code (from here: sankey diagram in R - data preparation - see a tidyverse way with networkd3 by CJ Yetman)

When I try and implement this myself my nodes get placed in the wrong order on the x-axis - rendering the flow impossible to understand.

However I cannot work out where the sankeyNetwork is getting information about the x-axis location.

Here is my implementation that does not yield the desired result:

library(tidyverse)
library(networkD3)

#Create the data
df <- data.frame('one' = c('a', 'b', 'b', 'a'), 
                 'two' = c('c', 'd', 'e', 'c'), 
                 'three' = c('f', 'g', 'f', 'f'))

#My code
#Create the links
links <- df %>%
  mutate(row = row_number()) %>% #Get row for grouping and pivoting
  pivot_longer(-row) %>% #pivot to long format
  group_by(row) %>% 
  mutate(source_c = lead(value)) %>% #Get flow 
  filter(!is.na(source_c)) %>% #Get rid of NA
  rename(target_c = value) %>% #Correct names
  group_by(target_c, source_c) %>% #Count frequencies
  summarize(value = n()) %>%
  ungroup() %>%
  mutate(target = as.integer(factor(target_c)), #Convert to numeric values
         source = as.integer(factor(source_c))) %>%
  mutate(source = source - 1, #zero index
         target = target - 1) %>%
  data.frame()

#create the nodes
nodes <- data.frame(name = factor(unique(c(links$target_c, links$source_c))))

#plot the network
sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
              Target = 'target', Value = 'value', NodeID = 'name')

Yields:

enter image description here

Using the working code from the linked answer:

links <-
  df %>% 
  mutate(row = row_number()) %>%  # add a row id
  gather('col', 'source', -row) %>%  # gather all columns
  mutate(col = match(col, names(df))) %>%  # convert col names to col nums
  mutate(source = paste0(source, '_', col)) %>%  # add col num to node names
  group_by(row) %>%
  arrange(col) %>%
  mutate(target = lead(source)) %>%  # get target from following node in row
  ungroup() %>% 
  filter(!is.na(target)) %>%  # remove links from last column in original data
  select(source, target) %>% 
  group_by(source, target) %>% 
  summarise(value = n())  # aggregate and count similar links

# create nodes data frame from unque nodes found in links data frame
nodes <- data.frame(id = unique(c(links$source, links$target)),
                    stringsAsFactors = FALSE)
# remove column id from names
nodes$name <- sub('_[0-9]*$', '', nodes$id)

# set links data to the 0-based index of the nodes in the nodes data frame
links$source <- match(links$source, nodes$id) - 1
links$target <- match(links$target, nodes$id) - 1

sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
              Target = 'target', Value = 'value', NodeID = 'name')

Yields a working result: enter image description here

I appreciate that the working code and my code are different, but I can't see where the rownumber (i.e. the x-axis) data is getting called by the sankeyNetwork - there is no call to any variable that contains that information. I think I can make my own code work to prep the data, once i know what it needs to look like.


Solution

  • As with all functions in , sankeyNetwork() determines the x and y positions of the nodes algorithmically based on their relationship to other nodes in the network, it does not read in x and y values directly from the data.

    The reason your version of code does not result in the same as the code you copied is because of how you factor and then coerce to integer the target and source node names in the links data frame separately. By doing so, the values that you associate with a given node in the source or target column are not in sync, and therefore your links are totally different than what you began with.

    Take a look at your links data frame and compare to the df data frame you begin with. For instance, the first row/link in your links data frame is a->c, but your target and source columns identify that as 0->0. Likewise the second row/link is b->d, but your target and source columns identify that as 1->1. And so forth...

    links
    #   target_c source_c value target source
    # 1        a        c     2      0      0
    # 2        b        d     1      1      1
    # 3        b        e     1      1      2
    # 4        c        f     2      2      3
    # 5        d        g     1      3      4
    # 6        e        f     1      4      3
    

    Additionally, because you use mutate(source_c = lead(value)) instead of mutate(target = lead(source)) as in the other code you copied, you reverse the flow of your links, so you would get a mirror image of what you're expecting.

    If you must set the target and source node ids in the links data frame inside the dplyr chain and mutate commands like that, you could set the levels of the factor command to the same thing, combining all unique values in both columns, like (but you'll still have to reverse your concept of source versus target to get the same result as the copied code)...

    library(tidyverse)
    library(networkD3)
    
    #Create the data
    df <- data.frame('one' = c('a', 'b', 'b', 'a'), 
                     'two' = c('c', 'd', 'e', 'c'), 
                     'three' = c('f', 'g', 'f', 'f'))
    
    #My code
    #Create the links
    links <- 
      df %>%
      mutate(row = row_number()) %>% #Get row for grouping and pivoting
      pivot_longer(-row) %>% #pivot to long format
      group_by(row) %>% 
      mutate(source_c = lead(value)) %>% #Get flow 
      filter(!is.na(source_c)) %>% #Get rid of NA
      rename(target_c = value) %>% #Correct names
      group_by(target_c, source_c) %>% #Count frequencies
      summarize(value = n()) %>%
      ungroup() %>%
      mutate(target = as.integer(factor(target_c, level = unique(c(target_c, source_c)))), #Convert to numeric values
             source = as.integer(factor(source_c, level = unique(c(target_c, source_c))))) %>%
      mutate(source = source - 1, #zero index
             target = target - 1) %>%
      data.frame()
    
    #create the nodes
    nodes <- data.frame(name = factor(unique(c(links$target_c, links$source_c))))
    
    #plot the network
    sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
                  Target = 'target', Value = 'value', NodeID = 'name')
    

    enter image description here