How to efficiently calculate the distance between two geopandas geometries

I have two geopandas data frames,one with a point geometry and one with line geometries, and am calculating the distance between the geometries. For each point I calculate the distance to the relevant line geometry of which the line geometry id is stored in a column of the point dataframe for reference. There are 321.113 point features for which the distance is calculated.

I'm trying to use list comprehension, but it still takes a looot of time. Way too long, as I will need to do this for even bigger data sets with way more point features. My code so far is as follows,

def get_distance(point_lineID, point_FID, point_GEOM, lines_df, points_df):

    ref_line = lines_df.loc[lines_df["line_id"] == point_lineID]

    try:
        d = point_GEOM.distance(ref_line["geometry"]).values[0]
        
    except IndexError:
        d = -99
        

    # Add value to frame
    row_num = points_df[points_df["point_id"] == point_FID].index
    points_df.loc[row_num, "distance_mp"] = d


result = [
    get_distance(point_lineid, point_fid, point_geom, df_lines, df_points)
    for point_lineid, point_fid, point_geom in zip(
        points["line_id"], points["point_id"], points["geometry"]
    )
]

How can I make this more performant? It would be awesome to have here some support with explanations.

Solution

There are several ways to potentially make the code more performant. Here are a few suggestions:

Use vectorization: Instead of iterating through each row in the points DataFrame, you can use vectorized operations to calculate the distances all at once. For example, you can use the apply method with a lambda function to apply the distance calculation to all rows at once:

    def get_distance(row, lines_df):
    ref_line = lines_df.loc[lines_df["line_id"] == row["line_id"]]
    try:
        return row["geometry"].distance(ref_line["geometry"]).values[0]
    except IndexError:
        return -99

points["distance_mp"] = points.apply(lambda row: get_distance(row, df_lines), axis=1)

Use spatial indexing: If the lines_df DataFrame is very large, using a spatial index (such as an R-tree) can significantly speed up the distance calculations. You can use the geopandas.sindex module to create a spatial index for the lines_df DataFrame:

from geopandas.sindex import RTree

# Create spatial index
index = RTree(lines_df.geometry)

def get_distance(row, index, lines_df):
    # Find nearest line using spatial index
    nearest_line_idx = list(index.nearest(row["geometry"].bounds))[0]
    nearest_line = lines_df.loc[nearest_line_idx]

    try:
        return row["geometry"].distance(nearest_line["geometry"])
    except IndexError:
        return -99

points["distance_mp"] = points.apply(lambda row: get_distance(row, index, df_lines), axis=1)

Use Cython or Numba: If the distance calculation is the bottleneck in your code, you can consider using Cython or Numba to speed up the calculation. These tools can compile your Python code to faster C code or machine code, respectively. Here's an example using Numba:

import numba as nb

@nb.jit(nopython=True)
def get_distance(point_lineID, point_GEOM, lines_df, line_lengths):
    min_dist = np.inf
    for i in range(len(lines_df)):
        if lines_df[i]["line_id"] == point_lineID:
            dist = point_GEOM.distance(lines_df[i]["geometry"])
            if dist < min_dist:
                min_dist = dist
                line_length = line_lengths[i]
    return min_dist, line_length

# Precompute line lengths for faster access
df_lines["length"] = df_lines["geometry"].length

# Create array of line lengths
line_lengths = df_lines["length"].values

distances = np.zeros(len(points))
for i in nb.prange(len(points)):
    distances[i], line_length = get_distance(points["line_id"][i], points["geometry"][i], df_lines, line_lengths)
    if distances[i] == -1:
        distances[i] = -99
points["distance_mp"] = distances