I have two GeoPandas DataFrames that I am attempting to join. I set the crs
for both and then use sjoin
and sjoin_nearest
, however, I am not seeing any results. The columns are NaNs
.
import pandas as pd
import geopandas as gpd
df1 = pd.DataFrame({
'id': [0, 1, 2],
'Lat': [41.8896878, 41.8854416, 33.155480],
'Long': [-87.6188015, -87.615478, -96.731630]
})
gdf1 = gpd.GeoDataFrame(df1, geometry=gpd.points_from_xy(df1.Long, df1.Lat))
# set crs for buffer calculations
gdf1.set_crs("ESRI:102003", inplace=True)
df2 = pd.DataFrame({
'val': ['a', 'b', 'c'],
'Lat': [41.88545, 41.885507, 33.15549],
'Long': [-87.61762, -87.615377, -96.73164]
})
gdf2 = gpd.GeoDataFrame(df2, geometry=gpd.points_from_xy(df2.Long, df2.Lat))
# set crs for buffer calculations
gdf2.set_crs("ESRI:102003", inplace=True)
# Spatial Join
joined_gdf = gpd.sjoin_nearest(
gdf1, # Point geometry
gdf2, # Point geometry
how='left',
max_distance = 0.001, # in meters
distance_col = "distances"
)
Second approach tried using sjoin
,
gdf1 = gdf1.set_crs(4326, allow_override=True)
gdf2 = gdf2.set_crs(4326, allow_override=True)
joined_gdf = gpd.sjoin(
gdf1,
gdf2,
predicate = 'intersects',
how = 'left'
)
I do not see the results that I'd expect i.e no matches / rows in the resultant joined dataframe. Not sure what is going on with the spatial joins.
The columns from right table are all NaNs
. The crs
for both DataFrames is:
<Derived Projected CRS: ESRI:102003>
Name: USA_Contiguous_Albers_Equal_Area_Conic
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: United States (USA) - CONUS onshore - Alabama; Arizona; Arkansas; California; Colorado; Connecticut; Delaware; Florida; Georgia; Idaho; Illinois; Indiana; Iowa; Kansas; Kentucky; Louisiana; Maine; Maryland; Massachusetts; Michigan; Minnesota; Mississippi; Missouri; Montana; Nebraska; Nevada; New Hampshire; New Jersey; New Mexico; New York; North Carolina; North Dakota; Ohio; Oklahoma; Oregon; Pennsylvania; Rhode Island; South Carolina; South Dakota; Tennessee; Texas; Utah; Vermont; Virginia; Washington; West Virginia; Wisconsin; Wyoming.
- bounds: (-124.79, 24.41, -66.91, 49.38)
Coordinate Operation:
- name: USA_Contiguous_Albers_Equal_Area_Conic
- method: Albers Equal Area
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
There are two issues with your join:
set_crs
instead of to_crs
to transform your geometries. set_crs
simply declares the CRS of the geometries; to_crs
actually transforms them from the declared CRS to another.max_distance
argument to sjoin
is too small, so there is no valid match within 0.01 meters, and there are no results returned. Increase the max_distance
parameter, e.g. to 1000, or omit it from the arguments to get a match for all points.set_crs
vs to_crs
The crs for your points is lat/lon (e.g. WGS84/EPSG:4326). You should set the crs as this using gdf.set_crs
, then convert to the desired crs using gdf.to_crs
, which actually transforms the data:
gdf1 = (
gdf1
.set_crs("EPSG:4326") # set_crs: declare that the lat/lons are in WGS84
.to_crs("ESRI:102003") # to_crs: transform the geometries to ESRI:102003
)
From the geopandas.GeoDataFrame.set_crs
API docs:
GeoDataFrame.set_crs
(crs=None, epsg=None, inplace=False, allow_override=False)
Set the Coordinate Reference System (CRS) of the
GeoDataFrame
.NOTE: The underlying geometries are not transformed to this CRS. To transform the geometries to a new CRS, use the
to_crs
method.
Compare with the docstring for geopandas.GeoDataFrame.to_crs
:
GeoDataFrame.to_crs
(crs=None, epsg=None, inplace=False)
Transform geometries to a new coordinate reference system.
Transform all geometries in an active geometry column to a different coordinate reference system. The
crs
attribute on the current GeoSeries must be set. Eithercrs
orepsg
may be specified for output.
When interpreted using your code, the values you provided are not within the domain of valid points for the crs you’re using (since they’re all within a hundred meters from (0, 0) lat/lon). If you change your code to correctly set the CRS as WGS84/EPSG:4326 and then transform the CRS with to_crs, they will correctly be points in the continental United States.
max_distance
is too smallAfter updating to the above, note that your max_distance
parameter is too stringent to allow any of the points to match. If you relax this max_distance parameter to 1000
or don't provide it, all three points will find a match.
Note that providing a max_distance is a good idea for catching issues - you would not have found this problem so quickly if you had allowed matches at any distance! But you do need to keep in mind that a too-strict max_distance will result in a lack of matches.
My full working runthrough of your code:
import pandas as pd
import geopandas as gpd
df1 = pd.DataFrame({
'id': [0, 1, 2],
'Lat': [41.8896878, 41.8854416, 33.155480],
'Long': [-87.6188015, -87.615478, -96.731630]
})
gdf1 = gpd.GeoDataFrame(df1, geometry=gpd.points_from_xy(df1.Long, df1.Lat))
# set crs for buffer calculations
gdf1 = gdf1.set_crs("epsg:4326").to_crs("ESRI:102003")
df2 = pd.DataFrame({
'val': ['a', 'b', 'c'],
'Lat': [41.88545, 41.885507, 33.15549],
'Long': [-87.61762, -87.615377, -96.73164]
})
gdf2 = gpd.GeoDataFrame(df2, geometry=gpd.points_from_xy(df2.Long, df2.Lat))
# set crs for buffer calculations
gdf2 = gdf2.set_crs("epsg:4326").to_crs("ESRI:102003")
# Spatial Join
joined_gdf = gpd.sjoin_nearest(
gdf1, # Point geometry
gdf2, # Point geometry
how='left',
max_distance = 1000, # in meters
distance_col = "distances",
)
this yields the following:
In [2]: joined_gdf
Out[2]:
id Lat_left Long_left geometry index_right val Lat_right Long_right distances
0 0 41.889688 -87.618802 POINT (689683.192 522217.641) 0 a 41.885450 -87.617620 484.004306
1 1 41.885442 -87.615478 POINT (689997.829 521768.561) 1 b 41.885507 -87.615377 11.081472
2 2 33.155480 -96.731630 POINT (-67811.160 -485877.116) 2 c 33.155490 -96.731640 1.450918