Join two geopandas df's using shortest point along line network - python

I'm joining two geopandas dataframes using gpd.sjoin_nearest. This returns the nearest point but I'm trying to work out what unit of measurement the output is in? I've done the approximate calculations in km but I don't think the projection is working.

I'm also aiming to incorporate a third geopandas dataframe (lines) where the nearest distance between points must travel along these lines. So be interconnected.

Is it possible to import a network of connected lines (roads) and measure the shortest distance between the same dataframes but the distance must be through the connected lines?

import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
from shapely.geometry import LineString

point1 = pd.DataFrame({
   'Cat': ['t1', 't2'],
   'LAT': [-20, -30],
   'LON': [140, 145],
   })

point2 = pd.DataFrame({
   'Cat': ['a', 'b'],
   'LAT': [-30, -20],
   'LON': [140, 145],
   })

lines = pd.DataFrame({
   'Cat': ['1', '1','2','2','3','3'],
   'LAT': [-10, -35, -30, -30, -40, -20],
   'LON': [140, 140, 130, 148, 145, 145],
   })

P1_gpd = gpd.GeoDataFrame(point1, geometry = gpd.points_from_xy(point1.LON, point1.LAT, crs = 4326))
P2_gpd = gpd.GeoDataFrame(point2, geometry = gpd.points_from_xy(point2.LON, point2.LAT, crs = 4326))
lines_gpd = gpd.GeoDataFrame(lines, geometry = gpd.points_from_xy(lines.LON, lines.LAT, crs = 4326))

P1_gpd = P1_gpd.to_crs("epsg:4326")
P2_gpd = P2_gpd.to_crs("epsg:4326")
lines_gpd = lines_gpd.to_crs("epsg:4326")

roads_gpd = lines_gpd.groupby(['Cat'])['geometry'].apply(lambda x: LineString(x.tolist()))
roads_gpd = gpd.GeoDataFrame(roads_gpd, geometry='geometry')

nearest_points = gpd.sjoin_nearest(P1_gpd, P2_gpd, 
distance_col="nearest_distance", lsuffix="left", rsuffix="right")

print(nearest_points)


fig, ax = plt.subplots()

P1_gpd.plot(ax = ax, markersize = 10, color = 'blue', zorder = 2)
P2_gpd.plot(ax = ax, markersize = 10, color = 'red', zorder = 2)
roads_gpd.plot(ax = ax, color = 'black')

plt.show()

The nearest point to t1 and t2 from P1_gpd is point a in P2_gpd.

I'm aiming to convert the distance to km. I've done the raw calculations. The first point has a distance of 1107km and the second is 482km.

intended output:

  Cat_left  LAT_left  LON_left                     geometry  index_right Cat_right  LAT_right  LON_right  nearest_distance
0       t1       -20       140  POINT (140.00000 -20.00000)            0         a        -30        140              1112
1       t2       -30       145  POINT (145.00000 -30.00000)            1         a        -30        140               481

Solution

Actually, this isn't just about unit of measurements, it's more of a Graph problem. You want to compute the distance (in kilometers) between the nearest points in P2(a or b) to P1(t1 and t2) along the roads (i.e, not in a free Cartesian plane).

If so, I would personally use a primal approach with momepy/networkx.

We start by building the graph using gdf_to_nx :

PX_gpd = pd.concat([P1_gpd, P2_gpd])

spl_lines = set(
    split(l, p)
    for l in roads_gpd.clip(box(*PX_gpd.total_bounds)).geometry
    for p in PX_gpd.geometry
)

G = momepy.gdf_to_nx(
    gpd.GeoDataFrame(geometry=list(spl_lines), crs="EPSG:4326")
        .explode()
        .to_crs("EPSG:20354"),
    approach="primal",
    multigraph=False,
    length="nearest_distance",
)

cat_names = (
    PX_gpd.to_crs("EPSG:20354").pipe(
        lambda x: x.set_index(
            x.get_coordinates().agg(tuple, axis=1)
        )["Cat"].to_dict()
    )
)

nx.set_node_attributes(G, {n: {"name": a} for n, a in cat_names.items()})

Then, we bring back a GeoDataFrame and keep only the smallest distances between (P1<=>P2) :

ps, ls = momepy.nx_to_gdf(G)
node_cols = ["node_start", "node_end"]
mapper_ids = ps.set_index("nodeID")["name"]

ndis = (
    ls.drop(columns=node_cols)
    .join(
        ls[node_cols]
        .replace(mapper_ids)
        .apply(lambda r: pd.Series(sorted(r)), axis=1)
        .set_axis(node_cols, axis=1)
    )
    .eval("nearest_distance = nearest_distance / 1e3")
    .loc[ # comment this chain if you need to preserve self-P1-paths
        lambda x: ~x[node_cols].agg(
            lambda r: set(r).issubset(P1_gpd["Cat"]), axis=1
        )
    ]
    .pipe(
        lambda x: x.loc[
            x.groupby(
                x[node_cols].isin(P1_gpd["Cat"].tolist())
                    .all().idxmax()
            )["nearest_distance"].idxmin()
        ]
    )
)

Output :

                             geometry  nearest_distance node_start node_end
0  LINESTRING (395267.479 7788031....       1107.445824          a       t1
1  LINESTRING (403427.466 6680615....        482.440476          a       t2

Or, if you want the distances (i.e, path_weigth) of all combinations of (P1<=>P2), you can do :

from itertools import product

cat1, cat2 = [], []
for n in G.nodes:
    (
        cat1 if G.nodes[n]["name"]
        in P1_gpd["Cat"].tolist()
        else cat2
    ).append(n)
    
gdf = gpd.GeoDataFrame(
    [
        dict(
            path=tuple(map(cat_names.get, c)),
            geometry=LineString(sp),
            nearest_distance=round(
                nx.path_weight(G, sp, "nearest_distance") / 1e3, 2,
            ),
        )
        for c in product(cat1, cat2)
        for sp in [nx.shortest_path(G, *c)]
    ],
    crs="EPSG:20354",
).to_crs("EPSG:4326")

Output :

      path                                         geometry  nearest_distance
0  (t1, a)                    LINESTRING (140 -20, 140 -30)           1107.45
1  (t1, b)  LINESTRING (140 -20, 140 -30, 145 -30, 145 -20)           2699.42
2  (t2, a)                    LINESTRING (145 -30, 140 -30)            482.44
3  (t2, b)                    LINESTRING (145 -30, 145 -20)           1109.53

NB: Since your inputs are in lon/lat (i.e EPSG:4326), you need a projected CRS. So, feel free to identify the proper projected CRS and remember that different projected CRS distort the Earth's surface differently when flattened onto a map. This leads to slight variations in distance calculations. For example, for the specific region of Australia (in your MRE), EPSG:20354 seems to provide distance calculations that closely match your expectations.

Plot (see full code) :