I'm joining two geopandas dataframes using gpd.sjoin_nearest
. This returns the nearest point but I'm trying to work out what unit of measurement the output is in? I've done the approximate calculations in km but I don't think the projection is working.
I'm also aiming to incorporate a third geopandas dataframe (lines
) where the nearest distance between points must travel along these lines. So be interconnected.
Is it possible to import a network of connected lines (roads
) and measure the shortest distance between the same dataframes but the distance must be through the connected lines?
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
from shapely.geometry import LineString
point1 = pd.DataFrame({
'Cat': ['t1', 't2'],
'LAT': [-20, -30],
'LON': [140, 145],
})
point2 = pd.DataFrame({
'Cat': ['a', 'b'],
'LAT': [-30, -20],
'LON': [140, 145],
})
lines = pd.DataFrame({
'Cat': ['1', '1','2','2','3','3'],
'LAT': [-10, -35, -30, -30, -40, -20],
'LON': [140, 140, 130, 148, 145, 145],
})
P1_gpd = gpd.GeoDataFrame(point1, geometry = gpd.points_from_xy(point1.LON, point1.LAT, crs = 4326))
P2_gpd = gpd.GeoDataFrame(point2, geometry = gpd.points_from_xy(point2.LON, point2.LAT, crs = 4326))
lines_gpd = gpd.GeoDataFrame(lines, geometry = gpd.points_from_xy(lines.LON, lines.LAT, crs = 4326))
P1_gpd = P1_gpd.to_crs("epsg:4326")
P2_gpd = P2_gpd.to_crs("epsg:4326")
lines_gpd = lines_gpd.to_crs("epsg:4326")
roads_gpd = lines_gpd.groupby(['Cat'])['geometry'].apply(lambda x: LineString(x.tolist()))
roads_gpd = gpd.GeoDataFrame(roads_gpd, geometry='geometry')
nearest_points = gpd.sjoin_nearest(P1_gpd, P2_gpd,
distance_col="nearest_distance", lsuffix="left", rsuffix="right")
print(nearest_points)
fig, ax = plt.subplots()
P1_gpd.plot(ax = ax, markersize = 10, color = 'blue', zorder = 2)
P2_gpd.plot(ax = ax, markersize = 10, color = 'red', zorder = 2)
roads_gpd.plot(ax = ax, color = 'black')
plt.show()
The nearest point to t1
and t2
from P1_gpd
is point a
in P2_gpd
.
I'm aiming to convert the distance to km. I've done the raw calculations. The first point has a distance of 1107km and the second is 482km.
intended output:
Cat_left LAT_left LON_left geometry index_right Cat_right LAT_right LON_right nearest_distance
0 t1 -20 140 POINT (140.00000 -20.00000) 0 a -30 140 1112
1 t2 -30 145 POINT (145.00000 -30.00000) 1 a -30 140 481
Actually, this isn't just about unit of measurements, it's more of a Graph problem. You want to compute the distance (in kilometers) between the nearest points in P2(a or b) to P1(t1 and t2) along the roads (i.e, not in a free Cartesian plane).
If so, I would personally use a primal approach with momepy/networkx.
We start by building the graph using gdf_to_nx
:
PX_gpd = pd.concat([P1_gpd, P2_gpd])
spl_lines = set(
split(l, p)
for l in roads_gpd.clip(box(*PX_gpd.total_bounds)).geometry
for p in PX_gpd.geometry
)
G = momepy.gdf_to_nx(
gpd.GeoDataFrame(geometry=list(spl_lines), crs="EPSG:4326")
.explode()
.to_crs("EPSG:20354"),
approach="primal",
multigraph=False,
length="nearest_distance",
)
cat_names = (
PX_gpd.to_crs("EPSG:20354").pipe(
lambda x: x.set_index(
x.get_coordinates().agg(tuple, axis=1)
)["Cat"].to_dict()
)
)
nx.set_node_attributes(G, {n: {"name": a} for n, a in cat_names.items()})
Then, we bring back a GeoDataFrame and keep only the smallest distances between (P1<=>P2
) :
ps, ls = momepy.nx_to_gdf(G)
node_cols = ["node_start", "node_end"]
mapper_ids = ps.set_index("nodeID")["name"]
ndis = (
ls.drop(columns=node_cols)
.join(
ls[node_cols]
.replace(mapper_ids)
.apply(lambda r: pd.Series(sorted(r)), axis=1)
.set_axis(node_cols, axis=1)
)
.eval("nearest_distance = nearest_distance / 1e3")
.loc[ # comment this chain if you need to preserve self-P1-paths
lambda x: ~x[node_cols].agg(
lambda r: set(r).issubset(P1_gpd["Cat"]), axis=1
)
]
.pipe(
lambda x: x.loc[
x.groupby(
x[node_cols].isin(P1_gpd["Cat"].tolist())
.all().idxmax()
)["nearest_distance"].idxmin()
]
)
)
Output :
geometry nearest_distance node_start node_end
0 LINESTRING (395267.479 7788031.... 1107.445824 a t1
1 LINESTRING (403427.466 6680615.... 482.440476 a t2
Or, if you want the distances (i.e, path_weigth
) of all combinations of (P1<=>P2
), you can do :
from itertools import product
cat1, cat2 = [], []
for n in G.nodes:
(
cat1 if G.nodes[n]["name"]
in P1_gpd["Cat"].tolist()
else cat2
).append(n)
gdf = gpd.GeoDataFrame(
[
dict(
path=tuple(map(cat_names.get, c)),
geometry=LineString(sp),
nearest_distance=round(
nx.path_weight(G, sp, "nearest_distance") / 1e3, 2,
),
)
for c in product(cat1, cat2)
for sp in [nx.shortest_path(G, *c)]
],
crs="EPSG:20354",
).to_crs("EPSG:4326")
Output :
path geometry nearest_distance
0 (t1, a) LINESTRING (140 -20, 140 -30) 1107.45
1 (t1, b) LINESTRING (140 -20, 140 -30, 145 -30, 145 -20) 2699.42
2 (t2, a) LINESTRING (145 -30, 140 -30) 482.44
3 (t2, b) LINESTRING (145 -30, 145 -20) 1109.53
NB: Since your inputs are in lon/lat (i.e EPSG:4326), you need a projected CRS. So, feel free to identify the proper projected CRS and remember that different projected CRS distort the Earth's surface differently when flattened onto a map. This leads to slight variations in distance calculations. For example, for the specific region of Australia (in your MRE), EPSG:20354 seems to provide distance calculations that closely match your expectations.
Plot (see full code) :