python join filter concatenation geopandas

Resolved: Geopandas sjoin_nearest() where dataframes share a common attribute

What is the most elegant way to group df1 and df2 by admin region as a constraint on a near spatial join?

Data

I have two spatial dataframes, df1 and df2. Both are from polygon shapefiles. The two dataframes have a common attribute, ADM, which is the administrative area that contains their centroids.

df1:

df1_ID year    ADM    geometry
9      2015    170    MULTIPOLYGON (((-1641592.97...
10     2015    170    MULTIPOLYGON (((-1641621.22...
11     2015    169    MULTIPOLYGON (((-1641621.22...
12     2011    169    MULTIPOLYGON (((-1641367.00...
13     1985    169    MULTIPOLYGON (((-1641367.00...
...   ...      ...    ...
21     2007    170    MULTIPOLYGON (((-1641367.00...
22     2013    170    MULTIPOLYGON (((-1641367.00...
23     2013    169    MULTIPOLYGON (((-1641367.00...
24     2011    169    MULTIPOLYGON (((-1641367.00...

df2:

df2_ID    settlement    ADM    geometry
0         7             169    MULTIPOLYGON (((-1639892.85...
1         7             170    MULTIPOLYGON (((-1645683.51...
2         4619          170    MULTIPOLYGON (((-1641531.18...

Desired spatial join output

I want to join the attributes of df2 onto df1 (left join) based on their spatial proximity: the nearest df2 feature that is within 500 meters of a df1 feature is joined onto the df1 feature row. However, the function should only consider df2 features that are in the same ADM region as the df1 features in question.

Map: df1 (green), df2 (purple), and the admin boundary (orange)

Geopandas.sjoin_nearest() can easily perform the near join, but it does not have an option to run "by group." For example, using just sjoin_nearest() would produce the following result. Notice that Feature 11 from df1 joins with Feature 2 from df2, and 23 joins with 1, despite being in different admin areas.

df1_Near_df2 = sjoin_nearest(df1, df2, how='left', max_distance=500)

Sjoin_nearest() without grouping:

df1_ID   year    ADM_left    Near_df2    ADM_right
9        2015    170         2           170
10       2015    170         2           170
11       2015    169         2           170
12       2011    169         0           169
13       1985    169         1           169
...      ...     ...         ...         ...
21       2007    170         1           170
22       2013    170         1           170
23       2013    169         1           170
24       2011    169         0           169

Desired result: sjoin_nearest() grouped by ADM:

df1_ID   year    ADM_left    Near_df2    ADM_right
9        2015    170         2           170
10       2015    170         2           170
11       2015    169         0           169
12       2011    169         0           169
13       1985    169         0           169
...      ...     ...         ...         ...
21       2007    170         1           170
22       2013    170         1           170
23       2013    169         0           169
24       2011    169         0           169

Notes

PS: I want to complete this task in Python. That said, for reference there is an ArcGIS Pro tool called Near By Group that does exactly this. The tool worked well on the sample dataset shown here, but it takes hours and seems to error out on much larger (country-sized) datasets. https://www.arcgis.com/home/item.html?id=37dbbaa29baa467d9da8e27d87d8ad45 https://www.esri.com/arcgis-blog/products/arcgis-desktop/analytics/finding-the-nearest-feature-with-the-same-attributes/

PPS: The closest Stack Overflow thread I have found is here: How to sjoin using geopandas using a common key column and also location. I have created a new question for the following reasons:

This is for sjoin_nearest() instead of sjoin().
There may be a spatial solution to my problem given that the common field for the grouping is an administrative area, whereas the linked example is grouping based on common time value.
I am providing a minimal reproducible example.

SOLUTION

@rob-raymond's updating sharding approach (see comments) works beautifully. It processed the country-wide version of this dataset in just minutes. One thing to keep in mind: it requires geopandas v0.11+, which in turn requires a more recent Python version. (Confirmed that Python v10 and v11 are successful). Note that if you are running your code via Anaconda, you need to update Python from within Anaconda (e.g. Anaconda Prompt command line).

Solution

Have adapted sharding approach.

First generate a dict of df2
then use groupby apply to sjoin_nearest() each group to appropriate shard
now have settlement associated in way you want

import geopandas as gpd


df1 = gpd.read_file("https://github.com/gracedoherty/urban_growth/blob/main/df1.zip?raw=true")
df2 = gpd.read_file("https://github.com/gracedoherty/urban_growth/blob/main/df2.zip?raw=true")

# shard the smaller dataframe into a dict
shards = {k:d for k, d in df2.groupby("ADM")}

# now just group by ADM, sjoin_nearest appropriate shard
df_n = df1.groupby("ADM").apply(lambda d: gpd.sjoin_nearest(d, shards[d["ADM"].values[0]]))

df_n.sample(5, random_state=42)

    year  ADM_left                                           geometry  index_right  ADM_right  settlement
44  1999       170  POLYGON ((-1641847.202 472181.866, -1641847.20...            1        170           7
4   2008       169  POLYGON ((-1641197.517 472577.326, -1641225.76...            0        169           7
53  2011       170  POLYGON ((-1641564.730 472153.619, -1641592.97...            1        170           7
42  2013       170  POLYGON ((-1641931.943 472181.866, -1641931.94...            1        170           7
10  2015       170  POLYGON ((-1641621.225 472436.091, -1641677.71...            2        170        4619

visualise

df_n.explore("settlement", categorical=True, height=300,width=400)

dependencies

geopandas v0.11+ which in turn drops support for python 3.7 CHANGELOG