I am looking through the documentation and tutorials for building a sankey plot using networkd3 in r via networkD3::sankeyNetwork()
.
I can get this working using someone else's code (from here: sankey diagram in R - data preparation - see a tidyverse way with networkd3 by CJ Yetman)
When I try and implement this myself my nodes get placed in the wrong order on the x-axis - rendering the flow impossible to understand.
However I cannot work out where the sankeyNetwork
is getting information about the x-axis location.
Here is my implementation that does not yield the desired result:
library(tidyverse)
library(networkD3)
#Create the data
df <- data.frame('one' = c('a', 'b', 'b', 'a'),
'two' = c('c', 'd', 'e', 'c'),
'three' = c('f', 'g', 'f', 'f'))
#My code
#Create the links
links <- df %>%
mutate(row = row_number()) %>% #Get row for grouping and pivoting
pivot_longer(-row) %>% #pivot to long format
group_by(row) %>%
mutate(source_c = lead(value)) %>% #Get flow
filter(!is.na(source_c)) %>% #Get rid of NA
rename(target_c = value) %>% #Correct names
group_by(target_c, source_c) %>% #Count frequencies
summarize(value = n()) %>%
ungroup() %>%
mutate(target = as.integer(factor(target_c)), #Convert to numeric values
source = as.integer(factor(source_c))) %>%
mutate(source = source - 1, #zero index
target = target - 1) %>%
data.frame()
#create the nodes
nodes <- data.frame(name = factor(unique(c(links$target_c, links$source_c))))
#plot the network
sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
Target = 'target', Value = 'value', NodeID = 'name')
Yields:
Using the working code from the linked answer:
links <-
df %>%
mutate(row = row_number()) %>% # add a row id
gather('col', 'source', -row) %>% # gather all columns
mutate(col = match(col, names(df))) %>% # convert col names to col nums
mutate(source = paste0(source, '_', col)) %>% # add col num to node names
group_by(row) %>%
arrange(col) %>%
mutate(target = lead(source)) %>% # get target from following node in row
ungroup() %>%
filter(!is.na(target)) %>% # remove links from last column in original data
select(source, target) %>%
group_by(source, target) %>%
summarise(value = n()) # aggregate and count similar links
# create nodes data frame from unque nodes found in links data frame
nodes <- data.frame(id = unique(c(links$source, links$target)),
stringsAsFactors = FALSE)
# remove column id from names
nodes$name <- sub('_[0-9]*$', '', nodes$id)
# set links data to the 0-based index of the nodes in the nodes data frame
links$source <- match(links$source, nodes$id) - 1
links$target <- match(links$target, nodes$id) - 1
sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
Target = 'target', Value = 'value', NodeID = 'name')
I appreciate that the working code and my code are different, but I can't see where the rownumber (i.e. the x-axis) data is getting called by the sankeyNetwork - there is no call to any variable that contains that information. I think I can make my own code work to prep the data, once i know what it needs to look like.
As with all functions in networkd3, sankeyNetwork()
determines the x
and y
positions of the nodes algorithmically based on their relationship to other nodes in the network, it does not read in x
and y
values directly from the data.
The reason your version of code does not result in the same as the code you copied is because of how you factor and then coerce to integer the target and source node names in the links data frame separately. By doing so, the values that you associate with a given node in the source or target column are not in sync, and therefore your links are totally different than what you began with.
Take a look at your links
data frame and compare to the df
data frame you begin with. For instance, the first row/link in your links
data frame is a->c
, but your target
and source
columns identify that as 0->0
. Likewise the second row/link is b->d
, but your target
and source
columns identify that as 1->1
. And so forth...
links
# target_c source_c value target source
# 1 a c 2 0 0
# 2 b d 1 1 1
# 3 b e 1 1 2
# 4 c f 2 2 3
# 5 d g 1 3 4
# 6 e f 1 4 3
Additionally, because you use mutate(source_c = lead(value))
instead of mutate(target = lead(source))
as in the other code you copied, you reverse the flow of your links, so you would get a mirror image of what you're expecting.
If you must set the target and source node ids in the links data frame inside the dplyr chain and mutate commands like that, you could set the levels of the factor command to the same thing, combining all unique values in both columns, like (but you'll still have to reverse your concept of source versus target to get the same result as the copied code)...
library(tidyverse)
library(networkD3)
#Create the data
df <- data.frame('one' = c('a', 'b', 'b', 'a'),
'two' = c('c', 'd', 'e', 'c'),
'three' = c('f', 'g', 'f', 'f'))
#My code
#Create the links
links <-
df %>%
mutate(row = row_number()) %>% #Get row for grouping and pivoting
pivot_longer(-row) %>% #pivot to long format
group_by(row) %>%
mutate(source_c = lead(value)) %>% #Get flow
filter(!is.na(source_c)) %>% #Get rid of NA
rename(target_c = value) %>% #Correct names
group_by(target_c, source_c) %>% #Count frequencies
summarize(value = n()) %>%
ungroup() %>%
mutate(target = as.integer(factor(target_c, level = unique(c(target_c, source_c)))), #Convert to numeric values
source = as.integer(factor(source_c, level = unique(c(target_c, source_c))))) %>%
mutate(source = source - 1, #zero index
target = target - 1) %>%
data.frame()
#create the nodes
nodes <- data.frame(name = factor(unique(c(links$target_c, links$source_c))))
#plot the network
sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
Target = 'target', Value = 'value', NodeID = 'name')