I am trying to calculate some kind of similarity between points with coordinates in geographical space. I will use an example to make things a bit more clear:
import pandas as pd
import geopandas as gpd
from geopy import distance
from shapely import Point
df = pd.DataFrame({
'Name':['a','b','c','d'],
'Value':[1,2,3,4],
'geometry':[Point(1,0), Point(1,2), Point(1,0), Point(3,3)]
})
gdf = gpd.GeoDataFrame(df, geometry=df.geometry)
print(gdf)
Name Value geometry
0 a 1 POINT (1.00000 0.00000)
1 b 2 POINT (1.00000 2.00000)
2 c 3 POINT (1.00000 0.00000)
3 d 4 POINT (3.00000 3.00000)
I need a new dataframe containing the distance between each pair of points, their similarity (Manhattan distance in this case) and also their other possible variables (in this case there is only name
as additional variable).
My solution is the following:
def calc_values_for_row(row, sourcepoint): ## sourcepoint is a row of tdf
sourcename = sourcepoint['Name']
targetname = row['Name']
manhattan = abs(sourcepoint['Value']-row['Value'])
sourcecoord = sourcepoint['geometry']
targetcoord = row['geometry']
dist_meters = distance.distance(np.array(sourcecoord.coords), np.array(targetcoord.coords)).meters
new_row = [sourcename, targetname, manhattan, sourcecoord, targetcoord, dist_meters]
new_row = pd.Series(new_row)
new_row.index = ['SourceName','TargetName','Manhattan','SourceCoord','TargetCoord','Distance (m)']
return new_row
def calc_dist_df(df):
full_df = pd.DataFrame()
for i in df.index:
tdf = df.loc[df.index>i]
if tdf.empty == False:
sliced_df = tdf.apply(lambda x: calc_values_for_row(x, df.loc[i]), axis=1)
full_df = pd.concat([full_df, sliced_df])
return full_df.reset_index(drop=True)
calc_dist_df(gdf)
### EXPECTED RESULT
SourceName TargetName Manhattan SourceCoord TargetCoord Distance (m)
0 a b 1 POINT (1 0) POINT (1 2) 222605.296097
1 a c 2 POINT (1 0) POINT (1 0) 0.000000
2 a d 3 POINT (1 0) POINT (3 3) 400362.335920
3 b c 1 POINT (1 2) POINT (1 0) 222605.296097
4 b d 2 POINT (1 2) POINT (3 3) 247555.571681
5 c d 1 POINT (1 0) POINT (3 3) 400362.335920
It works good as expected, BUT it is extremely slow for big datasets.
I am iterating once over each row of the dataframe to slice the gdf once and then I use .apply()
on the sliced gdf, but I was wondering if there is a way to avoid the first for
loop or maybe another solution to make this algorithm much faster.
NOTE
combination
from itertools might not be the solution because the geometry column can contain repeated values
EDIT
This is the distribution of repeated values for the 'geometry' column. As you can see most of the points are repeated and only a few are unique.
You can use scipy.spatial.distance_matrix
. Use .x
and .y
properties to extract coordinates from shapely Point:
from scipy.spatial import distance_matrix
RADIUS = 6371.009 * 1e3 # meters
cx = gdf.add_prefix('Source').merge(gdf.add_prefix('Target'), how='cross')
coords = np.radians(np.stack([gdf['geometry'].x, gdf['geometry'].y], axis=1))
cx['Distance'] = distance_matrix(coords, coords, p=2).ravel() * RADIUS
r, c = np.triu_indices(len(gdf), k=1)
cx = cx.loc[r * len(df) + c]
Output:
>>> cx
SourceName SourceValue Sourcegeometry TargetName TargetValue Targetgeometry Distance
1 a 1 POINT (1.00000 0.00000) b 2 POINT (1.00000 2.00000) 222390.167448
2 a 1 POINT (1.00000 0.00000) c 3 POINT (1.00000 0.00000) 0.000000
3 a 1 POINT (1.00000 0.00000) d 4 POINT (3.00000 3.00000) 400919.575947
6 b 2 POINT (1.00000 2.00000) c 3 POINT (1.00000 0.00000) 222390.167448
7 b 2 POINT (1.00000 2.00000) d 4 POINT (3.00000 3.00000) 248639.765971
11 c 3 POINT (1.00000 0.00000) d 4 POINT (3.00000 3.00000) 400919.575947