Search code examples
rdataframegeometryr-sf

How to create (using R's sf) a spatial data frame whose geometry column is the difference between the geometry columns of other two data frames?


I have two spatial data frames that I've loaded using the sf package in R. In the first data frame I have (61) US counties, while in the second I have a subpart of each of these (61) counties. Both data frames have the same identifying columns and the same values for these identifying columns, so for each county in the first data frame there is one, and only, corresponding subpart in the second data frame. My goal is: I would like to obtain a third spatial data frame whose geometry column is the difference between the county and its subpart.

I tried to implement the above idea basically using the following code:

library(tidyverse)
library(magrittr)
library(sf)

# Loading shapefiles
dfc <- st_read('data/counties.shp') %>%      # Shapefile with (61) US counties
  arrange(., st, cty)                        # Columns that uniquely identify observations
dfp <- st_read('data/county_parts.shp') %>%  # Shapefile with (61) US county parts
  arrange(., st, cty)                        # Columns that uniquely identify observations

# Geometry difference between each corresponding pair of observations
geom <- map(seq(nrow(dfc)), function(r) 
            st_difference(dfc[r, 'geo_cty'], dfp[r, 'geo_sub'])

(In the above code, geo_cty is the column with the county geometry and geo_sub is the county geometry subpart. Once sorted by st and cty, each row r in dfc and dfp have have the same (st, cty) values, and so the row geo_cty covers geo_sub.)

The above code ran smoothly, but I have not succeeded in creating a spatial data frame whose geometry column is the geom list that I've created. For instance, I tried running the following code:

df <- bind_cols(dfc, select(dfp, geo_sub), geom)

and I've got the following error message:

Error in `stop_vctrs()`:
! Can't recycle `..1` (size 61) to match `..58` (size 0).
Run `rlang::last_error()` to see where the error occurred.

How can I create the desired spatial data frame? What is the best way to implement the idea that I have explained?


Solution

  • I'd rather use join here to make sure geometries are correctly aligned, this would also handle cases where dataset lengths differ for whatever reason (the following example generates such input). Join creates an sf object with 2 geometry columns that can be conveniently mutated with map2() + st_difference():

    library(sf)
    #> Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE
    library(dplyr, warn.conflicts = FALSE)
    library(ggplot2)
    library(patchwork)
    
    ### prepare reprex
    # sf example dataset
    nc <- 
      read_sf(system.file("shape/nc.shp", package="sf")) |> 
      select(CNTY_ID, NAME)
    
    # reduce the sample to Johnston and neighbouring polygons
    nc <- st_filter(nc, st_buffer(nc[nc$NAME == "Johnston",], 10))
    
    # generate "subregions", buffers around centroids;
    # drop few records (20%) to simulte a partially matching datasest
    set.seed(123)
    nc_sub <- nc |>
      select(-NAME) |>
      st_centroid() |>
      st_buffer(10000) |>
      slice_sample(prop = .8)
    
    # -------------------------------------------------------------------------
    # join datasets, the other sf object (left_join y arg) must be converted to 
    # tibble/data.frame first; result is an sf object with 2 geometry columns 
    nc <- nc_sub |>
      st_set_geometry("geo_sub") |>
      as_tibble() |>
      left_join(x = nc, y = _, by = "CNTY_ID")
    
    # find difference between geom columns, 
    # store results in 3rd geomerty column (geom_diff)
    nc <- nc |>
      mutate(geo_diff = purrr::map2(geometry, geo_sub, st_difference) |> 
                        st_sfc(crs = st_crs(geometry)))
    
    
    # we can change active geometry and drop other geometry columns to  get a 
    # regular single-geometry sf object, if needed:
    nc_diff <- st_set_geometry(nc, "geo_diff") |> select(!starts_with("geo"))
    

    Resulting sf object:

    nc_diff
    #> Simple feature collection with 8 features and 2 fields
    #> Geometry type: GEOMETRY
    #> Dimension:     XY
    #> Bounding box:  xmin: -79.21665 ymin: 34.55375 xmax: -77.67122 ymax: 36.26004
    #> Geodetic CRS:  NAD27
    #> # A tibble: 8 × 3
    #>   CNTY_ID NAME                                                          geo_diff
    #>     <dbl> <chr>                                                   <GEOMETRY [°]>
    #> 1    1897 Franklin MULTIPOLYGON (((-78.25455 35.81553, -78.26685 35.84838, -78.…
    #> 2    1913 Nash     POLYGON ((-78.20562 35.7254, -78.21155 35.73819, -78.23405 3…
    #> 3    1938 Wake     POLYGON ((-78.99881 35.60132, -78.93889 35.76144, -78.94444 …
    #> 4    1979 Wilson   POLYGON ((-78.1249 35.59752, -78.16245 35.68298, -78.1618 35…
    #> 5    1989 Johnston MULTIPOLYGON (((-78.53874 35.31512, -78.53947 35.33646, -78.…
    #> 6    2029 Wayne    POLYGON ((-78.16518 35.19322, -78.2574 35.22095, -78.31013 3…
    #> 7    2030 Harnett  POLYGON ((-78.71606 35.25998, -78.81239 35.25872, -78.87457 …
    #> 8    2083 Sampson  POLYGON ((-78.11374 34.69918, -78.15676 34.67715, -78.25681 …
    

    Visualize:

    g1 <- ggplot(nc) + 
      geom_sf() +
      ggtitle("regions") +
      theme_bw()
    
    g2 <- ggplot() + 
      geom_sf(data = st_union(nc), fill = NA) +
      geom_sf(data = nc_sub) +
      ggtitle("sub-regions") +
      theme_bw() +
      theme(axis.text.y = element_blank())
    
    g3 <- ggplot(nc_diff) + 
      geom_sf() +
      ggtitle("st_difference()") +
      theme_bw() +
      theme(axis.text.y = element_blank())
    
    wrap_plots(g1,g2,g3)
    

    Input dataset, a small subset of nc.shp :

    nc
    #> Simple feature collection with 8 features and 2 fields
    #> Geometry type: MULTIPOLYGON
    #> Dimension:     XY
    #> Bounding box:  xmin: -79.21665 ymin: 34.55375 xmax: -77.67122 ymax: 36.26004
    #> Geodetic CRS:  NAD27
    #> # A tibble: 8 × 3
    #>   CNTY_ID NAME                                                          geometry
    #> *   <dbl> <chr>                                               <MULTIPOLYGON [°]>
    #> 1    1897 Franklin (((-78.25455 35.81553, -78.26685 35.84838, -78.30841 35.8934…
    #> 2    1913 Nash     (((-78.18693 35.72511, -78.20562 35.7254, -78.21155 35.73819…
    #> 3    1938 Wake     (((-78.92107 35.57886, -78.99881 35.60132, -78.93889 35.7614…
    #> 4    1979 Wilson   (((-78.06533 35.58204, -78.1249 35.59752, -78.16245 35.68298…
    #> 5    1989 Johnston (((-78.53874 35.31512, -78.53947 35.33646, -78.60082 35.4030…
    #> 6    2029 Wayne    (((-78.16319 35.18229, -78.16518 35.19322, -78.2574 35.22095…
    #> 7    2030 Harnett  (((-78.61274 35.24383, -78.71606 35.25998, -78.81239 35.2587…
    #> 8    2083 Sampson  (((-78.11377 34.72099, -78.11374 34.69918, -78.15676 34.6771…
    

    nc after joining (note missing geo_sub records) and st_difference(), 3 geometry columns:

    nc
    #> Simple feature collection with 8 features and 2 fields
    #> Active geometry column: geometry
    #> Geometry type: MULTIPOLYGON
    #> Dimension:     XY
    #> Bounding box:  xmin: -79.21665 ymin: 34.55375 xmax: -77.67122 ymax: 36.26004
    #> Geodetic CRS:  NAD27
    #> # A tibble: 8 × 5
    #>   CNTY_ID NAME                                geometry                   geo_sub
    #> *   <dbl> <chr>                     <MULTIPOLYGON [°]>             <POLYGON [°]>
    #> 1    1897 Franklin (((-78.25455 35.81553, -78.26685 3…                     EMPTY
    #> 2    1913 Nash     (((-78.18693 35.72511, -78.20562 3… ((-77.92674 35.88657, -7…
    #> 3    1938 Wake     (((-78.92107 35.57886, -78.99881 3… ((-78.65973 35.87538, -7…
    #> 4    1979 Wilson   (((-78.06533 35.58204, -78.1249 35… ((-77.83458 35.6446, -77…
    #> 5    1989 Johnston (((-78.53874 35.31512, -78.53947 3…                     EMPTY
    #> 6    2029 Wayne    (((-78.16319 35.18229, -78.16518 3… ((-77.89675 35.34607, -7…
    #> 7    2030 Harnett  (((-78.61274 35.24383, -78.71606 3… ((-78.93731 35.44001, -7…
    #> 8    2083 Sampson  (((-78.11377 34.72099, -78.11374 3… ((-78.29317 35.05105, -7…
    #> # ℹ 1 more variable: geo_diff <GEOMETRY [°]>
    

    Created on 2023-11-01 with reprex v2.0.2