Search code examples
r

Unsuccessful in exporting data from 1 sf object to the boundaries of another sf object , using st_intersection


I am having trouble aggregating the data in 1 spatial polygon file at the level of an overlapping spatial polygon file. Specifically, I have a shapefile of India's 1970 districts. I ALSO have a shapefile of India's 2020 districts, with district-level soil data merged in. I read both shapefiles in using readOGR() and then turned them into sf objects using st_as_sf(). I have put that code at the BOTTOM of this post, as I assume it's probably irrelevant.

After I have both the shapefiles (soilshape.sf is the 2020 district coundaries shapefile with the soil variables, eg Zn_s is a soil variable, and dist1970.sf is the 1970 district boundaries shapefile), I can run this code below to see the spatial overlap, link to picture here.

# successful! i can map 1970 districts over 2020 districts.
ggplot(data = soilshape.sf) +
  geom_sf(aes(fill = Zn_s)) +
  geom_sf(data = dist1970.sf, lwd = .5, color = "#000000",fill = NA)

However, next I want to extract the 2020 district-level soil data to the 1970 districts... many 1970 districts overlap multiple 2020 districts, as you can see in the pic, so I want to do a weighted mean of the 2020 districts overlaid on each 1970 district, the weight being their area within the 1970 district. ChatGPT semi-helped me put together the code below.. The overlay didn't work until I put in sf_use_s2(FALSE), but maybe that's causing a problem? (See addendum; no it's not.) And I couldn't figure out how to merge the vector of data within result back into an sf object, by ID, so I merged it to the sp version of my 1970 district shapefile, and then converted that again into an sf object.

# Perform spatial overlay to find overlapping districts
sf_use_s2(FALSE)
overlay <- st_intersection(soilshape.sf, dist1970.sf)

# Calculate area of each intersection
overlay$intersection_area <- st_area(overlay)

# Calculate weighted zinc mean within each 1970 district
# Note that ID is the unique ID for each polygon in dist1970.sf
result <- overlay %>%
  dplyr::group_by(ID) %>%
  dplyr::summarize(wZn_s = sum(Zn_s * intersection_area) / sum(intersection_area))

# Join the result back to dist1970.. i can only figure out how to do with the sp object
# note: x = the shapefile (ID in dist1970), y= the data.frame (ID in result)
dist1970.zn <- merge(dist1970, result, by.x ='ID', by.y ='ID', all.x=TRUE)  # keep districts NOT in result

# sf object again
dist1970.zn.sf <- st_as_sf(dist1970.zn)

In any case, this code above runs, but the output is wrong. First of all, 105 of the 1970 districts are missing soil info (sum(is.na(dist1970.zn@data$wZn_s)) is 105, of 386 districts). That can't be right. Also, I'm unable to map the wZn_s variable. This code:

ggplot(data = dist1970.zn.sf) +
  geom_sf(aes(fill = wZn_s)) 

returns the error:

Error in `geom_sf()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in `scale_type.units()`:
! Variable of class 'units' found, but 'units' package is not attached.
  Please, attach it using 'library(units)' to properly show scales with units.
Run `rlang::last_trace()` to see where the error occurred.

Help? What am I doing wrong when it comes to this spatial overlay?

Addendum: I have now tried mending a broken polygon (?) via st_make_valid() and re-running overlay <- st_intersection(soilshape.sf, dist1970.sf) with sf_use_s2(TRUE). The result is the same: 104 districts missing wZn_s data in the result object, and same error if I try to map dist1970.zn.sf. So apparently using sf_use_s2(FALSE) was NOT the problem.

CODE FOR BRINGING IN BOTH OF THE ORIGINAL SF FILES (PROB NOT RELEVANT BUT JUST IN CASE):

#==============================================================================#
## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 
## 2020 shapefile + merge in the dist-level created above  =====================
## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 
#==============================================================================#

# 2020 shapefile
setwd("/Users/bevis.16/Library/CloudStorage/OneDrive-TheOhioStateUniversity/BoxLeah/Projects/1 FINISHED/Kuznets/zenodo/data_raw/shpfiles")
dist2020 <- readOGR(dsn="districtshp2020", layer="output") # 20 district

# merge in district-level soil & DHS child data 
soildata <-  read.dta13("/Users/bevis.16/Library/CloudStorage/OneDrive-TheOhioStateUniversity/BoxLeah/Projects/1 FINISHED/Kuznets/distlevel.dta")

# note: x = the shapefile (distcode in dist2020), y= the data.frame (distcode in distlevel.dta)
soilshape <- merge(dist2020, soildata, by.x ='distcode', by.y ='distcode', all.x=TRUE)  # keep districts NOT in distlevel.dta
#soilshape <- merge(dist2020, soildata, by.x ='distcode', by.y ='distcode', all.x=FALSE)  # keep only districts in distlevel.dta

# make an sf objet out of dist2020 
soilshape.sf <- st_as_sf(soilshape)

#==============================================================================#
## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 
## Read 1970 shapefile, Collapse to unique 1970 districts ======================
## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## 
#==============================================================================#

setwd("/Users/bevis.16/Library/CloudStorage/OneDrive-TheOhioStateUniversity/BoxLeah/Projects/GR Inequality/Data/Geospatial")
distVDS <- readOGR(dsn="Districts_For_VDS", layer="india70again") # 20 district

# Make dist1970 unique by 1970 district ID to match 1966 districts:

  # Convert SpatialPolygons to data frame
  distVDS.df <- as(distVDS, "data.frame")

  # Aggregate and sum district area (distVDS.df[, 1]) by the 1970 district code (DISTCODE)
  # (For some reason multiple polygons are labeled with 1 1970 district, so merging)
  distVDS.df.agg <- aggregate(distVDS.df[, 1], list(ID=distVDS@data$DISTCODE), sum)
  row.names(distVDS.df.agg) <- as.character(distVDS.df.agg$ID)

  # Dissolve the actual shapefile to merge together by DISTCODE
  distVDS_mrg <- unionSpatialPolygons(distVDS,ID=distVDS@data$DISTCODE)

  # Give the shapefile that merged area attribute 
  dist1970 <- SpatialPolygonsDataFrame(distVDS_mrg, distVDS.df.agg)

# make dist1970 an sf object 
dist1970.sf <- st_as_sf(dist1970)

# remove clutter 
rm(distVDS, distVDS_mrg, distVDS.df, distVDS.df.agg)

# copy coordinate system of 2020 shapefile to 1970 shapefile (had none, before)
dist1970.sf <- st_set_crs(dist1970.sf, st_crs(soilshape.sf))

Solution

  • So I've created a mock dataset here to illustrate areal-weighted interpolation using sf (you may also try using the areal package written by Dr. Chris Prener at St. Louis University, but sf should be sufficient for your use case). I've used the tidycensus package to grab some U.S. Census data at the census tract level on the foreign-born population in Dallas County, Texas. Census tracts are a great example as they can change every decennial census as the population grows or declines.

    library(sf)
    library(dplyr)
    library(tidycensus)
    library(ggplot2)
    
    dallas_foreign_2020 <- get_acs(
      geography = "tract",
      year = 2020,
      # Total foreign-born population as an example
      variables = "B05002_013",
      state = "TX",
      county = "Dallas",
      geometry = TRUE
    ) |>
      # Used crsuggest::suggest_crs() for best CRS
      st_transform(6583)
    
    dallas_foreign_2010 <- get_acs(
      geography = "tract",
      year = 2010,
      # Total foreign-born population as an example
      variables = "B05002_013",
      state = "TX",
      county = "Dallas",
      geometry = TRUE
    ) |>
      select(estimate) |>
      st_transform(6583)
    
    dallas_foreign_interpol <- st_interpolate_aw(
      dallas_foreign_2010,
      dallas_foreign_2020,
      # uses weighted sums; set to FALSE to use weighted means
      extensive = TRUE
    ) |>
      mutate(GEOID = dallas_foreign_2020$GEOID)
    
    ggplot() +
      geom_sf(
        data = dallas_foreign_interpol,
        aes(fill = estimate)
      ) +
      scale_fill_distiller(palette = "PuOr", direction = -1) +
      labs(fill = "Total Foreign-Born \nPopulation") +
      theme_void()
    

    For areal interpolation to be accurate, your two sf objects must use the same coordinate reference system (CRS). You can check the CRS of each object with sf::st_crs() and if needed, change the CRS using sf::st_transform(). Depending on the type of variable(s) you have, when using sf::st_interpolate_aw(), you must set extensive=TRUE for count-based data (e.g., population counts) or extensive=FALSE for percentage- or rate-based data (e.g., population density). Again, the areal package has a helpful tutorial and explanation on this.

    My example data is count-based, so I've used extensive areal interpolation in this scenario to estimate from 2010 geographies --> 2020 geographies. If I had used something like the % of residents per tract that are foreign-born, I would need to use intensive interpolation.

    Do note that sf produces two warnings in this scenario: 1) st_interpolate_aw assumes attributes are constant or uniform over areas of x and 2) number of items to replace is not a multiple of replacement length. The former is due to a major assumption of areal interpolation: namely, that attributes are evenly distributed over space (which is rarely the case in real life--after all, the city is much more densely populated than the suburbs). The latter is due to the increase in the number of census tracts in 2020, mainly because of the population boom Dallas County is experiencing. This may be the case with your data as well (e.g., more districts in 2020 than in 1970). All in all, the end result looks like: Choropleth of Dallas County tracts by foreign-born population