Search code examples
pythonpandasperformancegeospatialgeopandas

Faster algorithm to calculate similarity between points in space


I am trying to calculate some kind of similarity between points with coordinates in geographical space. I will use an example to make things a bit more clear:

import pandas as pd
import geopandas as gpd
from geopy import distance
from shapely import Point

df = pd.DataFrame({
    'Name':['a','b','c','d'],
    'Value':[1,2,3,4],
    'geometry':[Point(1,0), Point(1,2), Point(1,0), Point(3,3)]
})
gdf = gpd.GeoDataFrame(df, geometry=df.geometry)
print(gdf)
  Name  Value                 geometry
0    a      1  POINT (1.00000 0.00000)
1    b      2  POINT (1.00000 2.00000)
2    c      3  POINT (1.00000 0.00000)
3    d      4  POINT (3.00000 3.00000)

I need a new dataframe containing the distance between each pair of points, their similarity (Manhattan distance in this case) and also their other possible variables (in this case there is only name as additional variable).

My solution is the following:

def calc_values_for_row(row, sourcepoint):  ## sourcepoint is a row of tdf
    sourcename = sourcepoint['Name']
    targetname = row['Name']
    manhattan = abs(sourcepoint['Value']-row['Value'])
    sourcecoord = sourcepoint['geometry']
    targetcoord = row['geometry']
    dist_meters = distance.distance(np.array(sourcecoord.coords), np.array(targetcoord.coords)).meters

    new_row = [sourcename, targetname, manhattan, sourcecoord, targetcoord, dist_meters]
    new_row = pd.Series(new_row)
    new_row.index = ['SourceName','TargetName','Manhattan','SourceCoord','TargetCoord','Distance (m)']
    return new_row

def calc_dist_df(df):
    full_df = pd.DataFrame()
    for i in df.index:
        tdf = df.loc[df.index>i]
        if tdf.empty == False:
            sliced_df = tdf.apply(lambda x: calc_values_for_row(x, df.loc[i]), axis=1)
            full_df = pd.concat([full_df, sliced_df])
    return full_df.reset_index(drop=True)

calc_dist_df(gdf)


### EXPECTED RESULT
  SourceName TargetName  Manhattan  SourceCoord  TargetCoord   Distance (m)
0          a          b          1  POINT (1 0)  POINT (1 2)  222605.296097
1          a          c          2  POINT (1 0)  POINT (1 0)       0.000000
2          a          d          3  POINT (1 0)  POINT (3 3)  400362.335920
3          b          c          1  POINT (1 2)  POINT (1 0)  222605.296097
4          b          d          2  POINT (1 2)  POINT (3 3)  247555.571681
5          c          d          1  POINT (1 0)  POINT (3 3)  400362.335920

It works good as expected, BUT it is extremely slow for big datasets. I am iterating once over each row of the dataframe to slice the gdf once and then I use .apply() on the sliced gdf, but I was wondering if there is a way to avoid the first for loop or maybe another solution to make this algorithm much faster.

NOTE
combination from itertools might not be the solution because the geometry column can contain repeated values
EDIT
This is the distribution of repeated values for the 'geometry' column. As you can see most of the points are repeated and only a few are unique. Distribution of PointID counts


Solution

  • You can use scipy.spatial.distance_matrix. Use .x and .y properties to extract coordinates from shapely Point:

    from scipy.spatial import distance_matrix
    
    RADIUS = 6371.009 * 1e3  # meters
    
    cx = gdf.add_prefix('Source').merge(gdf.add_prefix('Target'), how='cross')
    coords = np.radians(np.stack([gdf['geometry'].x, gdf['geometry'].y], axis=1))
    cx['Distance'] = distance_matrix(coords, coords, p=2).ravel() * RADIUS
    
    r, c = np.triu_indices(len(gdf), k=1)
    cx = cx.loc[r * len(df) + c]
    

    Output:

    >>> cx
       SourceName  SourceValue           Sourcegeometry TargetName  TargetValue           Targetgeometry       Distance
    1           a            1  POINT (1.00000 0.00000)          b            2  POINT (1.00000 2.00000)  222390.167448
    2           a            1  POINT (1.00000 0.00000)          c            3  POINT (1.00000 0.00000)       0.000000
    3           a            1  POINT (1.00000 0.00000)          d            4  POINT (3.00000 3.00000)  400919.575947
    6           b            2  POINT (1.00000 2.00000)          c            3  POINT (1.00000 0.00000)  222390.167448
    7           b            2  POINT (1.00000 2.00000)          d            4  POINT (3.00000 3.00000)  248639.765971
    11          c            3  POINT (1.00000 0.00000)          d            4  POINT (3.00000 3.00000)  400919.575947