Search code examples
riterationspatialshapefilenearest-neighbor

Combine census tracts with neighbor to reach a population threshold in R


I need to combine small geographic units (census tracts) with their neighbor with the smallest population in an iterative fashion until a population threshold (ex: >=200) is reached and the values for the combined units are summed (added).

The ultimate goal is to run an analysis of spatial autocorrelation but some units have such a small population that they should be combined with a neighbor (such as a merge in GIS that adds some of the values for the two features together). I also have some geographic units that are "islands" so have no neighbors. I'm working in R because I need code that can be re-used for multiple calculations but am also familiar with QGIS and Geoda so for "islands" with no neighbors I can manually alter the neighbors matrix to eliminate islands if needed but it would be better if instead I could write code to search for the nearest neighbor.

See sample data below of census tracts in Richmond County, NY (Staten Island) which has 7 tracts below a threshold of 200. Thanks in advance for your help!

library(tidycensus)
richmondtracts <- get_decennial(
  geography = "tract",
  state = "NY",
  county = "Richmond",
  variables = c(
    Hispanic = "P2_002N" #Hispanic population per tract
  ),
  summary_var = "P2_001N", #total population per tract
  year = 2020,
  geometry = TRUE
)

As a fairly new coder, I've been playing around with ChatGPT which has made some suggestions using while loops but they don't run. I'm a bit stuck but found this forum where geographic units are merged with a neighbor based on area: Merging small polygons with largest neighbour in R


Solution

  • TL;DR

    It may be better to manually do the merges.

    Preamble

    The following answer works in limited situations. That’s because, depending on the range and spatial distribution of values for a given ethnicity, there may be an ‘unrealistic’ number of permutations to account for in order to automate your desired outcome. To find a merge match for all tracts < the threshold in all situations requires an algorithm more complex than the one outlined below. This stands even if restricting merges to the neighbouring tract with the lowest value.

    I suggest examining the values of all ethnic groups at the same time to see if there are below threshold tracts that could be permanently merged. If that is not possible, a suggested workflow would be to first plot the tracts < the threshold and defining merge matches by ‘eyeballing’ the data and assigning merge ids manually. Then, pivot the data based on merge ids and use summarise() to merge the tracts. In essence, this is the foundation of the following answer.

    This method has only been tested on the outlined example data and is likely to fail in certain situations not limited to:

    • tracts that require > two degrees of tract separation in order to meet or exceed value threshold
    • where two tracts (x and y) match to the same neighbour (z) and the ‘tie breaker’ border length of xz and yz are identical. Although the likelihood of this happening is very low, it is not zero

    This method considers tracts with value == 0 as needing to be merged. Doing so tends to exacerbate the ecologically fallacious nature of census tract data. You will need to check how to manage this regarding your chosen analysis methods e.g. whether is it ok to include tracts with value == 0.

    For simplicity, neighbours that only share a single node/vertex are ignored. Also, there are no non-contiguous tracts in your example data so this answer includes a version of richmondtracts with two tracts identified as islands.

    The criteria of lowest value neighbour may not be the best approach given Tobler’s First Law. For instance, in the modified example data, the polygon with oid == 33 is matched with 32 and 29. Because 29 is further away than 55, it would arguably be more methodically sound to match 33 based on proximity.

    Some steps will return warnings. They can be safely ignored for the purposes of this answer. The “although coordinates are longitude/latitude, st_intersection assumes that they are planar” warning is telling you st_intersection() processes data as if were PCS and your data are defined by a GCS. For this solution, the returned results are fine. The “repeating attributes for all sub-geometries for which they may not be constant” warning is telling you that variables such as value and summary_value get repeated if for instance a MULTIPOLYGON gets split into two or more POLYGON features. In other words, do not sum value and/or summary_value if you join them back together.

    Workflow

    1. Add oid column to data as a key field for subsequent merges
    2. Create LINESTRING version of tract data to determine length of neighbour borders. In the event two or more tracts match to the same lowest value neighbour, length is used as a ‘tie break’
    3. Check for non-contiguous tracts (islands). If false, skip to step 6. If islands exist, create copies of tract data, one with only island and one with only contiguous.
    4. Find nearest neighbour of islands, assign each match a unique merge_id, and output to dataframe
    5. Join dataframe from previous step to tract data and output new df with merged polygons. No more islands
    6. Recreate LINESTRING version of new df tract data
    7. Create LINESTRING versions of polygons with value < threshold
    8. Find LINESTRINGs of neighbouring polygons, summarise() by neighbour oid, and calculate border length. Create columns to determine whether a neighbour can be considered for merging:
    • threshold: check the sum of the neighbours and below threshold tract enough to exceed threshold
    • dupe_match: return count of how many matches a neighbour has
    • dupe_rank: rank duplicates matches based on length
    • candidates: check if at least one neighbour can sum to reach threshold
    • filter() rows: if they do not sum to >= threshold only if there are other candidates; if duplicate matches do not have the longest border; and then return the match with minimum value only if a neighbour match exists
    • if no matches available, return neighbour’s neighbours and match to lowest neighbour’s neighbours value
    • arrange() so slice_head() returns lowest available value match
    • assign unique merge ids, pivot to long form, and use summarise() to merge matched tracts
    library(tidycensus)
    library(sf)
    library(dplyr)
    library(tidyr)
    library(lwgeom)
    library(ggplot2)
    
    # Original data
    richmondtracts <- get_decennial(
      geography = "tract",
      state = "NY",
      county = "Richmond",
      variables = c(
        Hispanic = "P2_002N" #Hispanic population per tract
      ),
      summary_var = "P2_001N", #total population per tract
      year = 2020,
      geometry = TRUE) %>%
      filter(!st_is_empty(.)) %>% # Remove empty geometries
      mutate(oid = 1:n()) # Add rownames as integer, oid is basis for defining merges
    
    ### Modifications for illustrative purposes, skip to step 1 if using original data
    # Modify richmondtracts to demonstrate permutations of islands and no neighbours > threshold
    richmondtracts <- richmondtracts %>%
      st_cast(., "POLYGON") %>%
      mutate(oid = 1:n())
    
    richmondtracts[95, c(1,2,4,5)] <- list("30000000000",
                              "Census Tract 000, Richmond County, New York",
                              1, 5)
    
    richmondtracts[c(9,32,60),"value"] <- c(1,2,3)
    ###
    
    # 1. Generate geometries for determining length of neighbouring polygon borders (for tie break)
    # Generate POINT version of polygon data
    p <- richmondtracts %>%
      st_cast(., "POLYGON") %>%
      st_cast(., "POINT") %>%
      st_intersection()
    
    # Generate LINESTRING version of polygon data
    l <- richmondtracts %>%
      st_cast(., "POLYGON") %>%
      st_cast(., "LINESTRING")
    
    # Split LINESTRING version to individual lines
    spl <- st_split(l, p) %>%
      st_collection_extract(., "LINESTRING") %>%
      mutate(row.id = 1:n())
    
    # 2. Merge islands to nearest nonisland
    # Get oid of all islands
    islands <- data.frame(st_intersects(richmondtracts)) %>%
      group_by(col.id) %>%
      filter(n() == 1) %>%
      ungroup() %>%
      ### NOTE: Only for modified example data, skip this line if using original data
      add_row(row.id = 35, col.id = 35)
    
    if(nrow(islands) == 0) {
      
      df <- richmondtracts
      print("No non-contiguous tracts found, skip to step 6")
      print("Optionally run step 4a for a pre-merge comparison plot")
      
    } else {
      
      print("Non-contiguous tracts found, continue on from here")
      
    }
    
    # If islands exist
    islands <- richmondtracts %>%
      filter(oid %in% islands[["row.id"]])
    
    # Get all non-islands
    nonisland <- richmondtracts %>%
      filter(!oid %in% islands[["oid"]])
    
    # 3. Find polygons in nonislands nearest to each island and assign each 'match' a unique merge_id
    islands <- islands %>%
      mutate(nearest = st_nearest_feature(islands, nonisland)) %>%
      data.frame() %>%
      group_by(oid) %>%
      mutate(merge_id = cur_group_id() + nrow(richmondtracts)) %>%
      ungroup() %>%
      select(oid, nearest, merge_id) %>%
      pivot_longer(-merge_id,
                   values_to = "oid") %>%
      select(oid, merge_id)
    
    # 4. Create new version of richmondtracts with islands merged to nearest nonislands
    df <- richmondtracts %>%
      left_join(., islands, by = "oid") %>%
      mutate(merge_id = ifelse(is.na(merge_id), oid, merge_id)) %>%
      group_by(merge_id) %>%
      summarise(GEOID = first(GEOID),
                oid = first(oid),
                value = sum(value),
                summary_value = sum(summary_value),
                .groups = "drop") %>%
      ungroup()
    
    # 4a. Return polygons with values below threshold (for plot)
    polysub <- richmondtracts %>%
      filter(as.numeric(unlist(value)) < 200) %>%
      mutate(row.id = as.integer(rownames(.)))
    
    ggplot() +
      geom_sf(data = richmondtracts,
              colour = "blue",
              fill = "grey90",
              lwd = .5,
              linetype = "dotted") +
      geom_sf(data = polysub,
              fill = "red") +
      geom_sf(data = df,
              color = "grey70",
              fill = NA,
              lwd = .5) +
      geom_sf_text(data = filter(df, !oid %in% polysub[["oid"]]),
                   aes(label = oid),
                   size = 1,
                   fun.geometry = st_centroid,
                   show.legend = FALSE) +
      geom_sf_text(data = polysub,
                   aes(label = oid),
                   size = 2,
                   fun.geometry = st_centroid,
                   show.legend = FALSE) +
      theme(panel.background = element_rect(fill = "#e3eeff", colour = "#e3eeff"),
            axis.title = element_blank(),
            axis.text = element_blank(),
            axis.ticks = element_blank(),
            panel.grid = element_blank(),
            plot.margin=grid::unit(c(0,0,0,0), "mm"))
    
    # 5. Merge tracts to ensure no polygon value < given threshold (e.g. 200)
    # Regenerate geometries for determining length of polygon borders based on updated tracts.
    p <- df %>%
      st_cast(., "POLYGON") %>%
      st_cast(., "POINT") %>%
      st_intersection()
    
    l <- df %>%
      st_cast(., "POLYGON") %>%
      st_cast(., "LINESTRING")
    
    spl <- st_split(l, p) %>%
      st_collection_extract(., "LINESTRING") %>%
      mutate(row.id = 1:n())
    
    # 6. Return polygons with values below threshold
    polysub <- df %>%
      filter(as.numeric(unlist(value)) < 200) %>%
      mutate(row.id = as.integer(rownames(.)))
    
    # Return LINESTRING version of polysub
    linesub <- spl %>%
      filter(.[["oid"]] %in% polysub[["oid"]]) %>%
      mutate(row.id = 1:n())
    
    # 7. Create new version of df with polygons < threshold merged to neighbour with lowest value
    df1 <- linesub %>%
      st_equals(., spl) %>%
      data.frame() %>%
      left_join(., select(data.frame(linesub), oid, row.id, value, summary_value),
                by = join_by(row.id)) %>%
      left_join(., spl, by = join_by(col.id == row.id),
                relationship = "many-to-many") %>%
      select(-col.id) %>%
      filter(!oid.x == oid.y) %>%
      st_as_sf() %>%
      group_by(oid.x, oid.y)  %>%
      summarise(value = min(value.x),
                value1 = min(value.y),
                summary_value = min(summary_value.x),
                summary_value1 = min(summary_value.y),
                .groups = "drop") %>%
      ungroup() %>%
      mutate(length_m = st_length(.)) %>%
      filter(!oid.x == oid.y) %>%
      group_by(oid.y) %>%
      mutate(threshold = ifelse((value + value1) >= 200, 1, 0),
             dupe_match = n(),
             dupe_rank = rank(length_m)) %>%
      group_by(oid.x) %>%
      mutate(candidates = any(sum(threshold) > 0)) %>%
      filter(!threshold == 0 & candidates == TRUE | candidates == FALSE) %>%
      filter(!dupe_rank < dupe_match & candidates == TRUE | candidates == FALSE) %>%
      filter(value1 == min(value1) | candidates == FALSE) %>%
      ungroup() %>%
      mutate(value1 = ifelse(candidates == FALSE,
                             value1[match(oid.y, oid.x)],
                             value1),
             oid.y = ifelse(candidates == FALSE,
                            oid.y[match(oid.y, oid.x)],
                            oid.y)) %>%
      group_by(oid.x) %>%
      arrange(oid.x, value1) %>%
      slice_head() %>%
      mutate(value1 = ifelse(candidates == FALSE, 0, value1)) %>%
      data.frame() %>%
      group_by(oid.y) %>%
      mutate(merge_id = cur_group_id() + nrow(richmondtracts)) %>%
      ungroup() %>%
      select(oid.x, oid.y, merge_id) %>%
      pivot_longer(-merge_id,
                   values_to = "oid") %>%
      select(oid, merge_id) %>%
      distinct() %>%
      left_join(if("merge_id" %in% names(df)) { select(df, -merge_id) } else { df }, .,
                by = "oid") %>%
      mutate(merge_id = ifelse(is.na(merge_id), oid, merge_id)) %>%
      group_by(merge_id) %>%
      summarise(GEOID = first(GEOID),
                oid = first(oid),
                value = sum(value),
                summary_value = sum(summary_value),
                .groups = "drop") %>%
      ungroup()
    
    ggplot() +
      geom_sf(data = df1,
              color = NA,
              fill = "grey90",
              lwd = .5) +
      geom_sf(data = filter(df1, merge_id > nrow(richmondtracts)),
              aes(fill = factor(oid)),
              lwd = .5,
              show.legend = FALSE) +
      geom_sf(data = df1,
              color = "grey70",
              fill = NA,
              lwd = .5) +
      geom_sf(data = polysub,
              color = "blue",
              fill = NA,
              lwd = .75,
              linetype = "dotted") +
      geom_sf(data = richmondtracts,
              color = "grey40",
              fill = NA,
              lwd = .4,
              linetype = "dotted") +
      geom_sf_text(data = filter(df1, merge_id > nrow(richmondtracts)),
                   aes(label = oid),
                   size = 2,
                   fun.geometry = st_centroid,
                   show.legend = FALSE) +
      geom_sf_text(data = df,
                   aes(label = oid),
                   size = 1,
                   fun.geometry = st_centroid,
                   show.legend = FALSE) +
      theme(panel.background = element_rect(fill = "#e3eeff", colour = "#e3eeff"),
            axis.title = element_blank(),
            axis.text = element_blank(),
            axis.ticks = element_blank(),
            panel.grid = element_blank(),
            plot.margin=grid::unit(c(0,0,0,0), "mm"))
    

    Modified example data result pre and post merge:

    mod data

    Original data result pre and post merge:

    original data