I'm trying to take a dataset of ZIP codes and limit it to just ZIP codes within Chicago. However, any way I try to do this merge captures either far too many or too few ZIP codes. Here's a reproducible example:
## Load packages
library(tigris)
library(sf)
library(ggplot2)
## Load shapefiles
ZIPs <- tigris::zctas(cb = TRUE)
ZIPs <- sf::st_as_sf(ZIPs)
places <- tigris::places(state = "17", cb = T)
chicago <- places[places$NAME == "Chicago",]
chicago <- sf::st_as_sf(chicago)
## Filter ZIPs to those within Chicago using st_intersects
overlap <- st_filter(ZIPs, chicago, .predicate = st_intersects) #Using st_intersects captures too many ZIPs
## Visualize ZIPs vs Chicago
ggplot() +
geom_sf(data = overlap, color = "black", size = 1) +
geom_sf(data = chicago, color = NA, fill = "blue", alpha = .25)
## Try again using st_within
overlap <- st_filter(ZIPs, chicago, .predicate = st_within) #Using st_within captures too few ZIPs
## Visualize ZIPs vs Chicago
ggplot() +
geom_sf(data = overlap, color = "black", size = 1) +
geom_sf(data = chicago, color = NA, fill = "blue", alpha = .25)
I've also tried to use sp::over
for this task but run into the same problem. There are clearly some ZIPs that are mostly outside of Chicago, but legitimately have some overlap (e.g. the three ZIPs on the top left of the first map). However there are others that only intersect along a border (e.g. top right), and even one that doesn't appear to intersect at all (bottom right). I want to exclude from this map any ZIPs that only intersect by border. Any advice?
Here I proposed a function that can filter the ZIPs
based on the ratio of intersected area and the original area compared to a threshold. Below is an example how to use this function. It seems like threshold = 0.3
works pretty well, but you can set any threshold based on your needs.
## Load packages
library(tigris)
library(sf)
library(ggplot2)
library(dplyr)
# A function that can filter ZIPs based on the ratio of intersected area to original area
# The default of the threshold is set to be 0.3
# If the ratio is larger than or equal to 0.3, that ZIPs would be kept
intersection_area <- function(x, y, threshold = 0.3){
z <- st_intersection(x, y)
z2 <- z %>%
mutate(Area_Inter = as.numeric(st_area(.))) %>%
select(ZCTA5CE10, Area_Inter) %>%
st_set_geometry(NULL)
x2 <- x %>%
st_filter(y, .predicate = st_intersects) %>%
mutate(Area = as.numeric(st_area(.))) %>%
select(ZCTA5CE10, Area) %>%
left_join(z2, by = "ZCTA5CE10") %>%
mutate(Area_Ratio = Area_Inter/Area) %>%
filter(Area_Ratio >= threshold)
return(x2)
}
overlap <- intersection_area(ZIPs, chicago)
## Visualize ZIPs vs Chicago
ggplot() +
geom_sf(data = overlap, color = "black", size = 1) +
geom_sf(data = chicago, color = NA, fill = "blue", alpha = .25)