Search code examples
rsankey-diagram

overlaying a sankey diagram on a map


Is there a way to overlay a Sankey diagram that I had already made on R (networkD3) on a map? The Sankey is about migration, the nodes are the countries of origin, and destination and the links are the amount of people who migrate. Is it possible to overlay my Sankey on a map using R? Or are there other ways to do this? What I am trying to do is to show the flow of people geographically, overlaid on a map. this is one of the Sankeys I made using R, and the dataset I used for this is "our world in data" so making the Sankey diagram was quite simple.

flow1990_data <-
  structure(
    list(
      Year = c(1990L, 1990L, 1990L),
      Destination = c("Argentina",
                      "Argentina", "Argentina"),
      `Country of Origin` = structure(
        1:3,
        levels = c(
          "Emigrants.from.Mexico",
          "Emigrants.from.Guatemala",
          "Emigrants.from.Honduras",
          "Emigrants.from.Nicaragua",
          "Emigrants.from.Costa.Rica",
          "Emigrants.from.Panama",
          "Emigrants.from.Colombia",
          "Emigrants.from.Venezuela",
          "Emigrants.from.Ecuador",
          "Emigrants.from.Brazil",
          "Emigrants.from.Peru",
          "Emigrants.from.Bolivia",
          "Emigrants.from.Argentina",
          "Emigrants.from.Paraguay",
          "Emigrants.from.Uruguay",
          "Emigrants.from.Panama",
          "Emigrants.from.Chile",
          "Emigrants.from.El.Salvador",
          "Emigrants.from.Cuba",
          "Emigrants.from.Dominican.Republic",
          "Emigrants.from.Haiti"
        ),
        class = "factor"
      ),
      `Number of Emigrants` = c(2332L, 116L, 141L)
    ),
    row.names = c("1",
                  "23", "45"),
    class = "data.frame"
  )

I see a lot of Sankey diagrams overlaid on maps on the internet so I thought it was not impossible to do this, but I cannot find a way. I have only used R, but if it is possible with other programming languages, I would try that as well. If I have to make the Sankey diagram from the beginning in a different way, I would be happy to do that.


Solution

  • TL;DR
    You are better off creating your desired output manually using a program like Adobe Illustrator or an open source program like Inkscape. There is just too much variability to contend with. However, the method outlined below may help with this as you can save the plots as .svg files, and then edit them outside of the R environment. This would speed up the process of scaling line widths etc.

    Preamble
    There is an extensive list of issues to contend with in order to create an automated Sankey-ish diagram on a map. The reason Sankey diagrams are so powerful is that they plot on straight axes, which makes them easy to interpret. Trying to replicate this on a map where axes are not uniform makes it difficult to produce clean results. This is why, as @CJYetman alluded to in their comment, these types of maps are best created manually.

    Another complication is to do with plot units. Graphical parameters such as ggplot2's linewidth are independent of map units. For example, linewidth = 10 != 10 metres. If they were equal, it would make things somewhat easier. For this reason, it is difficult to recreate the bars that appear on the y-axes of a Sankey diagram. This example simplifies the process by having all lines converge at a single point. In addition, given the range of values in the data, some scaling of linewidth is required. I have used a scaled range of 0.1 - 10, which grossly underestimates the true range. However, this appears to be the case in your example Sankey diagram also, and scaling is a necessary compromise.

    This repex is informed by these (and many other) limitations in mind. Further issues will be discussed in turn. For simplicity, I have chosen to use a subset of the migration .csv data from Our World in Data based on your example data.

    Load required packages and create example data:

    library(rnaturalearth)
    library(sf)
    library(lwgeom)
    library(dplyr)
    library(tidyr)
    library(BAMMtools)
    library(ggplot2)
    library(ggtext)
    library(patchwork)
    
    # Get list of countries of interest from your example data
    countries <- gsub("Emigrants.from.", "", levels(flow1990_data$`Country of Origin`))
    countries <- unique(gsub("\\.", " ", countries))
    
    # Create df for country of interest (replace all instances country name in code if not Argentina)
    # Data added to working directory from https://ourworldindata.org/migration
    Argentina <- read.csv("migration-flows.csv") %>%
      filter(Year == 1990 & Country == "Argentina") %>%
      pivot_longer(-c(Year, Country)) %>%
      mutate(value = abs(value)) %>%
      extract(name, into = c("var", "Country"), regex = "([^.]+\\.[^.]+)\\.(.+)") %>%
      mutate(Country = gsub("\\.", " ", Country)) %>%
      filter(Country %in% countries & Country != "Argentina") %>%
      pivot_wider(id_cols = c(Year, Country),
                  names_from = var,
                  values_from = value)
    
    # Create world map
    world <- ne_countries() %>%
      select(c(sovereignt, admin))
    
    # Check names match if amend if necessary. If result != character(0), edit values
    # so that country names match. Otherwise, subsequent joins will not work properly.
    setdiff(countries, world$admin)
    # character(0)
    

    Now that the emigration, immigration, and world sf polygon data are loaded, create an st_centroid() sf of each country. This will be used to determine the origin and destination of each connecting geom_curve(). My preference is to use the convenient st_centroid() function and then manually correct any centroids that look 'unbalanced'. Using the alternative st_point_on_surface() function looks untidy IMHO, and in this case places the centroid for Chile somewhere near Tierra del Fuego! There are other methods for creating centroids, so you are not limited to these two options.

    # Create country centroids for deriving geom_curve() xend/yend
    sf_centroids <- st_transform(world, 3857) %>%
      st_centroid() %>%
      st_transform(4326)
    
    # Warning message:
    # st_centroid assumes attributes are constant over geometries
    
    # Correct centroids where they fall outside geometry or aren't centred well
    # Generate x/y and xend/yend coordinates for geom_curve()
    sf_sankey1 <- filter(sf_centroids, admin %in% countries) %>%
      mutate(lon = if_else(admin %in% c("Chile", "Peru"),
                           st_coordinates(.)[,1] - 1.5,
                           st_coordinates(.)[,1]),
             lat = st_coordinates(.)[,2]) %>%
      st_drop_geometry() %>%
      st_as_sf(coords = c("lon", "lat"), crs = st_crs(world)) %>%
      left_join(., Argentina, by = join_by(admin == Country)) %>%
      mutate(geometry1 = geometry[admin == "Argentina"]) %>%
      filter(!admin == "Argentina") %>%
      mutate(x1 = st_coordinates(geometry1)[,1],
             y1 = st_coordinates(geometry1)[,2],
             x2 = st_coordinates(geometry)[,1],
             y2 = st_coordinates(geometry)[,2],
             curvature_var = if_else(x1 <= x2, "east", "west"),
             diff = abs(x1 -x2))
    

    The next step is to determine how much curvature to give each curve to minimise 'clumping' and overlaps. This is where things get tricky. In this example, it seems to makes sense to increase curvature as difference in longitude between origin and destination increases. There are likely better ways to to achieve this, but no method will be 'perfect' for every scenario. As mentioned earlier, the emigration/immigration values also need to be scaled. To do this, it is easier to first create a function.

    # Create function to scale migration and diff values for linewidth and curvature
    scale_values <- function(values, min, max, new_min, new_max) {
      norm_val <- (values - min) / (max - min)
      scaled <- norm_val * (new_max - new_min) + new_min
      return(scaled)
    }
    
    # Get min and max values from Emigrates.from and Immigrants.to
    to_from <- c(na.omit(sf_sankey1$Emigrants.from), na.omit(sf_sankey1$Immigrants.to))
    
    min_val <- min(to_from)
    max_val <- max(to_from)
    
    # Set linewidth scale min/max values
    min_lwd <- 0.1
    max_lwd <- 10
    
    # Add scaled linewidth values columns to sf_sankey1
    sf_sankey1$in_scaled <- scale_values(sf_sankey1$Emigrants.from, min_val, max_val, min_lwd, max_lwd)
    sf_sankey1$out_scaled <- scale_values(sf_sankey1$Immigrants.to, min_val, max_val, min_lwd, max_lwd)
    
    # Get min and max values from diff
    min_val1 <- min(sf_sankey1$diff)
    max_val1 <- max(sf_sankey1$diff)
    
    # Set min and max curvature values
    cur_min <- 0.3
    cur_max <- 0.5
    
    # Add scaled curvature values columns to sf_sankey1
    sf_sankey1$curvature <- scale_values(sf_sankey1$diff, min_val1, max_val1, cur_min, cur_max)
    

    Because geom_curve() does not accept curvature as an aesthetic, each curve needs to be plotted individually. This is why apply() is used in the plot code. This complicates the process for creating a legend, so create a 'fake' sf with desired legend breaks which will be plotted behind the world map features. To do this, I chose Jenks natural breaks because of the wide range of values in the data. Adapt the number and values of the breaks to suit (an alternate method is included).

    # Set number of desired breaks
    num_breaks <- 7
    
    # Get coordinates for suitable location
    country <- filter(sf_centroids, admin == "Argentina")
    
    # Generate break values
    breaks <- getJenksBreaks(to_from, num_breaks)
    # breaks <- as.numeric(quantile(to_from, probs = seq(0, 1, length.out = num_breaks)))
    
    # Create sf to represent legend
    sf_legend <- data.frame(id = rep(1:num_breaks, 2),
                            lon = rep(c(st_coordinates(country)[,1],
                                        st_coordinates(country)[,1] + 0.001), 
                                      each = num_breaks),
                            lat = st_coordinates(country)[,2],
                            row.names = NULL) %>%
      st_as_sf(coords = c("lon", "lat"), crs = st_crs(world)) %>%
      summarise(geometry = st_combine(geometry), .by = id) %>%
      st_cast("LINESTRING") %>%
      mutate(flow = breaks, lwd_scaled = scale_values(flow, min_val, max_val, min_lwd, max_lwd))
    

    And finally, the plot. I have included a non-exhaustive list of variables to make tweaking the output easier.

    # Subset world using countries and return bounding box coordinates (for plot extent)
    sf_sankey2 <- filter(world, admin %in% countries)
    map_cent <- st_union(sf_sankey2) %>% st_bbox()
    
    # Set offset in map units for plot extent, and other variable for geom_curve() etc.
    offset <- 2
    angle <- 90
    ncp <- 20
    alpha_in <- 0.5
    alpha_out <- 0.4
    max_in <- max(sf_sankey1$in_scaled)
    max_out <- max(sf_sankey1$out_scaled)
    colour_in <- "#0072B2"
    colour_out <- "#920000"
    
    # Plot
    p <- ggplot() +
      geom_sf(data = world, colour = "grey70", fill = "grey99") +
      geom_sf(data = sf_sankey2, colour = "grey40", fill = "#f6fad4") +
      geom_sf(data = sf_sankey1, size = 0.5) +
      lapply(split(filter(sf_sankey1, curvature_var == "west"),
                   1:nrow(filter(sf_sankey1, curvature_var == "west"))),
             function(x) { geom_curve(data = x, aes(x = x1, y = y1, xend = x2, yend = y2),
                                      curvature = x$curvature * -1,
                                      linewidth = x$in_scaled,
                                      colour = colour_in,
                                      ncp = ncp,
                                      alpha = alpha_in,
                                      angle = angle) } ) +
      lapply(split(filter(sf_sankey1, curvature_var == "east"),
                   1:nrow(filter(sf_sankey1, curvature_var == "east"))),
             function(x) { geom_curve(data = x, aes(x = x1, y = y1, xend = x2, yend = y2),
                                      curvature = x$curvature,
                                      linewidth = x$in_scaled,
                                      colour = colour_in,
                                      ncp = ncp,
                                      alpha = alpha_in,
                                      angle = angle) } ) +
      geom_sf(data = country, colour = colour_in, size = max_in, pch = 16, alpha = alpha_in) +
      geom_sf_text(data = country,
                   label = "Immigration 1990\nArgentina",
                   size = 3,
                   nudge_x = 16,
                   nudge_y = -7,
                   fun.geometry = st_centroid,
                   colour = "black") +
      coord_sf(xlim = c(map_cent[[1]] - offset, map_cent[[3]] + offset),
               ylim = c(map_cent[[2]] - offset, map_cent[[4]] + offset),
               expand = FALSE) +
      theme_void()
    
    p + ggplot() +
      geom_sf(data = sf_legend, aes(linewidth = factor(lwd_scaled)), colour = "grey30") +
      geom_sf(data = world, colour = "grey70", fill = "grey99") +
      geom_sf(data = sf_sankey2, colour = "grey40", fill = "#f6fad4") +
      geom_sf(data = sf_sankey1, size = 0.5) +
      lapply(split(filter(sf_sankey1, curvature_var == "west"),
                   1:nrow(filter(sf_sankey1, curvature_var == "west"))),
             function(x) { geom_curve(data = x, aes(x = x2, y = y2, xend = x1, yend = y1),
                                      curvature = x$curvature,
                                      linewidth = x$out_scaled,
                                      colour = colour_out,
                                      ncp = ncp,
                                      alpha = alpha_out,
                                      angle = angle) } ) +
      lapply(split(filter(sf_sankey1, curvature_var == "east"),
                   1:nrow(filter(sf_sankey1, curvature_var == "east"))),
             function(x) { geom_curve(data = x, aes(x = x2, y = y2, xend = x1, yend = y1),
                                      curvature = x$curvature * -1,
                                      linewidth = x$out_scaled,
                                      colour = colour_out,
                                      ncp = ncp,
                                      alpha = alpha_out,
                                      angle = angle) } ) +
      scale_discrete_manual("linewidth",
                            name = paste0("<span style='font-size: 9pt'>", "Count<br>",
                                          "<span style='font-size: 8pt'>", "key = Jenks"),
                            values = sf_legend$lwd_scaled,
                            labels = scales::label_comma(accuracy = 1)(sf_legend$flow)) +
      geom_sf(data = country, colour = colour_out, size = max_out, pch = 16, alpha = alpha_out) +
      geom_sf_text(data = country,
                   label = "Emigration 1990\nArgentina",
                   size = 3,
                   nudge_x = 16,
                   nudge_y = -7,
                   fun.geometry = st_centroid,
                   colour = "black") +
      coord_sf(xlim = c(map_cent[[1]] - offset, map_cent[[3]] + offset),
               ylim = c(map_cent[[2]] - offset, map_cent[[4]] + offset),
               expand = FALSE) +
      theme_void() +
      theme(legend.title = element_markdown(),
            legend.text = element_text(size = 6, hjust = 1))
    

    result