What is the most elegant way to group df1 and df2 by admin region as a constraint on a near spatial join?
I have two spatial dataframes, df1 and df2. Both are from polygon shapefiles. The two dataframes have a common attribute, ADM, which is the administrative area that contains their centroids.
df1:
df1_ID year ADM geometry
9 2015 170 MULTIPOLYGON (((-1641592.97...
10 2015 170 MULTIPOLYGON (((-1641621.22...
11 2015 169 MULTIPOLYGON (((-1641621.22...
12 2011 169 MULTIPOLYGON (((-1641367.00...
13 1985 169 MULTIPOLYGON (((-1641367.00...
... ... ... ...
21 2007 170 MULTIPOLYGON (((-1641367.00...
22 2013 170 MULTIPOLYGON (((-1641367.00...
23 2013 169 MULTIPOLYGON (((-1641367.00...
24 2011 169 MULTIPOLYGON (((-1641367.00...
df2:
df2_ID settlement ADM geometry
0 7 169 MULTIPOLYGON (((-1639892.85...
1 7 170 MULTIPOLYGON (((-1645683.51...
2 4619 170 MULTIPOLYGON (((-1641531.18...
I want to join the attributes of df2 onto df1 (left join) based on their spatial proximity: the nearest df2 feature that is within 500 meters of a df1 feature is joined onto the df1 feature row. However, the function should only consider df2 features that are in the same ADM region as the df1 features in question.
Map: df1 (green), df2 (purple), and the admin boundary (orange)
Geopandas.sjoin_nearest() can easily perform the near join, but it does not have an option to run "by group." For example, using just sjoin_nearest() would produce the following result. Notice that Feature 11 from df1 joins with Feature 2 from df2, and 23 joins with 1, despite being in different admin areas.
df1_Near_df2 = sjoin_nearest(df1, df2, how='left', max_distance=500)
Sjoin_nearest() without grouping:
df1_ID year ADM_left Near_df2 ADM_right
9 2015 170 2 170
10 2015 170 2 170
11 2015 169 2 170
12 2011 169 0 169
13 1985 169 1 169
... ... ... ... ...
21 2007 170 1 170
22 2013 170 1 170
23 2013 169 1 170
24 2011 169 0 169
Desired result: sjoin_nearest() grouped by ADM:
df1_ID year ADM_left Near_df2 ADM_right
9 2015 170 2 170
10 2015 170 2 170
11 2015 169 0 169
12 2011 169 0 169
13 1985 169 0 169
... ... ... ... ...
21 2007 170 1 170
22 2013 170 1 170
23 2013 169 0 169
24 2011 169 0 169
PS: I want to complete this task in Python. That said, for reference there is an ArcGIS Pro tool called Near By Group that does exactly this. The tool worked well on the sample dataset shown here, but it takes hours and seems to error out on much larger (country-sized) datasets. https://www.arcgis.com/home/item.html?id=37dbbaa29baa467d9da8e27d87d8ad45 https://www.esri.com/arcgis-blog/products/arcgis-desktop/analytics/finding-the-nearest-feature-with-the-same-attributes/
PPS: The closest Stack Overflow thread I have found is here: How to sjoin using geopandas using a common key column and also location. I have created a new question for the following reasons:
@rob-raymond's updating sharding approach (see comments) works beautifully. It processed the country-wide version of this dataset in just minutes. One thing to keep in mind: it requires geopandas v0.11+, which in turn requires a more recent Python version. (Confirmed that Python v10 and v11 are successful). Note that if you are running your code via Anaconda, you need to update Python from within Anaconda (e.g. Anaconda Prompt command line).
Have adapted sharding approach.
df2
sjoin_nearest()
each group to appropriate shardsettlement
associated in way you wantimport geopandas as gpd
df1 = gpd.read_file("https://github.com/gracedoherty/urban_growth/blob/main/df1.zip?raw=true")
df2 = gpd.read_file("https://github.com/gracedoherty/urban_growth/blob/main/df2.zip?raw=true")
# shard the smaller dataframe into a dict
shards = {k:d for k, d in df2.groupby("ADM")}
# now just group by ADM, sjoin_nearest appropriate shard
df_n = df1.groupby("ADM").apply(lambda d: gpd.sjoin_nearest(d, shards[d["ADM"].values[0]]))
df_n.sample(5, random_state=42)
year ADM_left geometry index_right ADM_right settlement
44 1999 170 POLYGON ((-1641847.202 472181.866, -1641847.20... 1 170 7
4 2008 169 POLYGON ((-1641197.517 472577.326, -1641225.76... 0 169 7
53 2011 170 POLYGON ((-1641564.730 472153.619, -1641592.97... 1 170 7
42 2013 170 POLYGON ((-1641931.943 472181.866, -1641931.94... 1 170 7
10 2015 170 POLYGON ((-1641621.225 472436.091, -1641677.71... 2 170 4619
df_n.explore("settlement", categorical=True, height=300,width=400)
geopandas v0.11+ which in turn drops support for python 3.7 CHANGELOG