I am working on a project for university, where I have two pandas dataframes:
# Libraries
import pandas as pd
from geopy import distance
# Dataframes
df1 = pd.DataFrame({'id': [1,2,3],
'lat':[-23.48, -22.94, -23.22],
'long':[-46.36, -45.40, -45.80]})
df2 = pd.DataFrame({'id': [100,200,300],
'lat':[-28.48, -22.94, -23.22],
'long':[-46.36, -46.40, -45.80]})
I need to calculate distances between geographic latitude and longitude coordinates between dataframes. So I used geopy. If the distance between the coordinate combination is less than a threshold of 100 meters, then I must assign the value 1 in the 'nearby' column. I made the following code:
threshold = 100 # meters
df1['nearby'] = 0
for i in range(0, len(df1)):
for j in range(0, len(df2)):
coord_geo_1 = (df1['lat'].iloc[i], df1['long'].iloc[i])
coord_geo_2 = (df2['lat'].iloc[j], df2['long'].iloc[j])
var_distance = (distance.distance(coord_geo_1, coord_geo_2).km) * 1000
if(var_distance < threshold):
df1['nearby'].iloc[i] = 1
Although a warning appears, the code is working. However, I would like to find a way to override for() iterations. It's possible?
# Output:
id lat long nearby
1 -23.48 -46.36 0
2 -22.94 -45.40 0
3 -23.22 -45.80 1
If you can use the library scikit-learn, the method haversine_distances
calculate the distance between two sets of coordinates. so you get:
from sklearn.metrics.pairwise import haversine_distances
# variable in meter you can change
threshold = 100 # meters
# another parameter
earth_radius = 6371000 # meters
df1['nearby'] = (
# get the distance between all points of each DF
haversine_distances(
# note that you need to convert to radiant with *np.pi/180
X=df1[['lat','long']].to_numpy()*np.pi/180,
Y=df2[['lat','long']].to_numpy()*np.pi/180)
# get the distance in meter
*earth_radius
# compare to your threshold
< threshold
# you want to check if any point from df2 is near df1
).any(axis=1).astype(int)
print(df1)
# id lat long nearby
# 0 1 -23.48 -46.36 0
# 1 2 -22.94 -45.40 0
# 2 3 -23.22 -45.80 1
EDIT: OP ask for a version with distance from geopy, so here is a way.
df1['nearby'] = (np.array(
[[(distance.distance(coord1, coord2).km)
for coord2 in df2[['lat','long']].to_numpy()]
for coord1 in df1[['lat','long']].to_numpy()]
) * 1000 < threshold
).any(1).astype(int)