Search code examples
pandasgeospatialgeopandas

GeoPandas spatial join - no matching rows in the resultant dataset


I have two GeoPandas DataFrames that I am attempting to join. I set the crs for both and then use sjoin and sjoin_nearest, however, I am not seeing any results. The columns are NaNs.

import pandas as pd
import geopandas as gpd

df1 = pd.DataFrame({
                    'id': [0, 1, 2], 
                    'Lat': [41.8896878, 41.8854416, 33.155480],
                    'Long': [-87.6188015, -87.615478, -96.731630]
                  })


gdf1 = gpd.GeoDataFrame(df1, geometry=gpd.points_from_xy(df1.Long, df1.Lat))

# set crs for buffer calculations
gdf1.set_crs("ESRI:102003", inplace=True)



df2 = pd.DataFrame({
                    'val': ['a', 'b', 'c'],
                    'Lat': [41.88545, 41.885507, 33.15549],
                    'Long': [-87.61762, -87.615377, -96.73164]
                  })

 
    

gdf2 = gpd.GeoDataFrame(df2, geometry=gpd.points_from_xy(df2.Long, df2.Lat))

# set crs for buffer calculations
gdf2.set_crs("ESRI:102003", inplace=True)


# Spatial Join
joined_gdf = gpd.sjoin_nearest(
                               gdf1,  # Point geometry
                               gdf2,  # Point geometry
                               how='left',
                               max_distance = 0.001, # in meters
                               distance_col = "distances"
                              )

Second approach tried using sjoin,

gdf1 = gdf1.set_crs(4326, allow_override=True) 
gdf2 = gdf2.set_crs(4326, allow_override=True) 


joined_gdf = gpd.sjoin(
                        gdf1,
                        gdf2,
                        predicate = 'intersects',
                        how = 'left'
                      )

I do not see the results that I'd expect i.e no matches / rows in the resultant joined dataframe. Not sure what is going on with the spatial joins.

The columns from right table are all NaNs. The crs for both DataFrames is:

<Derived Projected CRS: ESRI:102003>
Name: USA_Contiguous_Albers_Equal_Area_Conic
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: United States (USA) - CONUS onshore - Alabama; Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming.
- bounds: (-124.79, 24.41, -66.91, 49.38)
Coordinate Operation:
- name: USA_Contiguous_Albers_Equal_Area_Conic
- method: Albers Equal Area
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

Solution

  • There are two issues with your join:

    • You have used set_crs instead of to_crs to transform your geometries. set_crs simply declares the CRS of the geometries; to_crs actually transforms them from the declared CRS to another.
    • Additionally, your max_distance argument to sjoin is too small, so there is no valid match within 0.01 meters, and there are no results returned. Increase the max_distance parameter, e.g. to 1000, or omit it from the arguments to get a match for all points.

    set_crs vs to_crs

    The crs for your points is lat/lon (e.g. WGS84/EPSG:4326). You should set the crs as this using gdf.set_crs, then convert to the desired crs using gdf.to_crs, which actually transforms the data:

    gdf1 = (
        gdf1
        .set_crs("EPSG:4326")   # set_crs: declare that the lat/lons are in WGS84
        .to_crs("ESRI:102003")  # to_crs:  transform the geometries to ESRI:102003
    )
    

    From the geopandas.GeoDataFrame.set_crs API docs:

    GeoDataFrame.set_crs (crs=None, epsg=None, inplace=False, allow_override=False)

    Set the Coordinate Reference System (CRS) of the GeoDataFrame.

    NOTE: The underlying geometries are not transformed to this CRS. To transform the geometries to a new CRS, use the to_crs method.

    Compare with the docstring for geopandas.GeoDataFrame.to_crs:

    GeoDataFrame.to_crs (crs=None, epsg=None, inplace=False)

    Transform geometries to a new coordinate reference system.

    Transform all geometries in an active geometry column to a different coordinate reference system. The crs attribute on the current GeoSeries must be set. Either crs or epsg may be specified for output.

    When interpreted using your code, the values you provided are not within the domain of valid points for the crs you’re using (since they’re all within a hundred meters from (0, 0) lat/lon). If you change your code to correctly set the CRS as WGS84/EPSG:4326 and then transform the CRS with to_crs, they will correctly be points in the continental United States.

    max_distance is too small

    After updating to the above, note that your max_distance parameter is too stringent to allow any of the points to match. If you relax this max_distance parameter to 1000 or don't provide it, all three points will find a match.

    Note that providing a max_distance is a good idea for catching issues - you would not have found this problem so quickly if you had allowed matches at any distance! But you do need to keep in mind that a too-strict max_distance will result in a lack of matches.

    complete soluttion

    My full working runthrough of your code:

    import pandas as pd
    import geopandas as gpd
    df1 = pd.DataFrame({
                        'id': [0, 1, 2],
                        'Lat': [41.8896878, 41.8854416, 33.155480],
                        'Long': [-87.6188015, -87.615478, -96.731630]
                      })
    
    gdf1 = gpd.GeoDataFrame(df1, geometry=gpd.points_from_xy(df1.Long, df1.Lat))
    
    # set crs for buffer calculations
    gdf1 = gdf1.set_crs("epsg:4326").to_crs("ESRI:102003")
    
    df2 = pd.DataFrame({
                        'val': ['a', 'b', 'c'],
                        'Lat': [41.88545, 41.885507, 33.15549],
                        'Long': [-87.61762, -87.615377, -96.73164]
                      })
    
    gdf2 = gpd.GeoDataFrame(df2, geometry=gpd.points_from_xy(df2.Long, df2.Lat))
    
    # set crs for buffer calculations
    gdf2 = gdf2.set_crs("epsg:4326").to_crs("ESRI:102003")
    
    # Spatial Join
    joined_gdf = gpd.sjoin_nearest(
        gdf1,  # Point geometry
        gdf2,  # Point geometry
        how='left',
        max_distance = 1000, # in meters
        distance_col = "distances",
    )
    

    this yields the following:

    In [2]: joined_gdf
    Out[2]:
       id   Lat_left  Long_left                        geometry  index_right val  Lat_right  Long_right   distances
    0   0  41.889688 -87.618802   POINT (689683.192 522217.641)            0   a  41.885450  -87.617620  484.004306
    1   1  41.885442 -87.615478   POINT (689997.829 521768.561)            1   b  41.885507  -87.615377   11.081472
    2   2  33.155480 -96.731630  POINT (-67811.160 -485877.116)            2   c  33.155490  -96.731640    1.450918