I've been going around in circles and I can't figure out the right way to do this.
Ok so I have several dataframes. Within each dataframe I have coordinates in column d['LATITUDE']
and d['LONGITUDE']
. I want to figure out how many other rows have coordinates within n miles of each row. I created a few new columns to store the counts but actually filing them has me stumped. I've read its not good to iterate over rows in a dataframe, but I can't figure out how to do this a better way. It seems to work but takes hours for each dataframe.
import numpy as np
import pandas as pd
from geopy.distance import geodesic
#example input
df = pd.DataFrame({'LATITUDE':[38.9547, 38.9404, 38.9032, 38.864, 38.8639, 38.9017, 38.947783, 38.8629, 38.9478, 38.9017, 38.9491, 38.8643, 38.8643, 38.9464, 38.903],
'LONGITUDE':[-77.0463, -77.048, -77.1417, -77.156, -77.1631, -77.131, -77.03983, -77.1558, -77.0439, -77.131, -77.0461, -77.1539, -77.1539, -77.0385, -77.1411]})
d['1_mile'] = np.nan
d['5_miles'] = np.nan
d['10_miles'] = np.nan
for n, r in enumerate(d.itertuples(), 1):
df = d.apply(lambda row: geodesic((d['LATITUDE'].iloc[n], d['LONGITUDE'].iloc[n]), (row['LATITUDE'], row['LONGITUDE'])).miles, axis=1)
c = len(df[(df<=1) & (df>0)])
d.at[n,'1_mile'] = c
c = len(df[(df<=5) & (df>0)])
d.at[n,'5_miles'] = c
c = len(df[(df<=10) & (df>0)])
d.at[n,'10_miles'] = c
#desired output
LATITUDE LONGITUDE 1_mile 5_miles 10_miles
0 38.954700 -77.04630 5.0 5.0 14.0
1 38.940400 -77.04800 5.0 5.0 14.0
2 38.903200 -77.14170 3.0 8.0 14.0
3 38.864000 -77.15600 4.0 8.0 14.0
4 38.863900 -77.16310 4.0 8.0 14.0
5 38.901700 -77.13100 2.0 7.0 13.0
6 38.947783 -77.03983 5.0 5.0 14.0
7 38.862900 -77.15580 4.0 8.0 14.0
8 38.947800 -77.04390 5.0 5.0 14.0
9 38.901700 -77.13100 2.0 7.0 13.0
10 38.949100 -77.04610 5.0 5.0 14.0
11 38.864300 -77.15390 3.0 7.0 13.0
12 38.864300 -77.15390 3.0 7.0 13.0
13 38.946400 -77.03850 5.0 5.0 14.0
14 38.903000 -77.14110 3.0 8.0 14.0
It works but I know it's not the best way to do this. Can anyone please help me clean it up?
Thank you!
It's been a year without an answer so figured I should post a solution for anyone with a similar dilemma. Kudos to @Wightboy for suggesting the self join.
Through trial and error, I've learned that it's immensely faster to run the coordinates through a function rather than iterate through each row.
import pandas as pd
import numpy as np
def haversine(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = np.radians([lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
haver_formula = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
r = 3958.756 #6371 for distance in KM for miles use 3958.756
dist = 2 * r * np.arcsin(np.sqrt(haver_formula))
return pd.Series(dist)
df = pd.DataFrame({'ID':list(range(15)),
'LATITUDE':[38.9547, 38.9404, 38.9032, 38.864, 38.8639, 38.9017, 38.947783, 38.8629, 38.9478, 38.9017, 38.9491, 38.8643, 38.8643, 38.9464, 38.903],
'LONGITUDE':[-77.0463, -77.048, -77.1417, -77.156, -77.1631, -77.131, -77.03983, -77.1558, -77.0439, -77.131, -77.0461, -77.1539, -77.1539, -77.0385, -77.1411]})
#self join table
df2 = pd.merge(df.assign(key=1),df.assign(key=1), on='key', suffixes=('', '_2')).drop('key', axis=1)
#drop duplicates
df2 = df2[df2['ID']!=df2['ID_2']].reset_index(drop=True)
#find distances
df2['dist'] = haversine(df2['LONGITUDE'], df2['LATITUDE'], df2['LONGITUDE_2'], df2['LATITUDE_2'])
for m in [1, 5, 10]:
df = pd.merge(df, df2[['ID', 'dist']][df2['dist']<=m].groupby('ID').count().reset_index().rename(columns={'dist':str(m)+'_mile'}), how='left')
#output
ID LATITUDE LONGITUDE 1_mile 5_mile 10_mile
0 0 38.954700 -77.04630 5 5 14
1 1 38.940400 -77.04800 5 5 14
2 2 38.903200 -77.14170 3 8 14
3 3 38.864000 -77.15600 4 8 14
4 4 38.863900 -77.16310 4 8 14
5 5 38.901700 -77.13100 3 8 14
6 6 38.947783 -77.03983 5 5 14
7 7 38.862900 -77.15580 4 8 14
8 8 38.947800 -77.04390 5 5 14
9 9 38.901700 -77.13100 3 8 14
10 10 38.949100 -77.04610 5 5 14
11 11 38.864300 -77.15390 4 8 14
12 12 38.864300 -77.15390 4 8 14
13 13 38.946400 -77.03850 5 5 14
14 14 38.903000 -77.14110 3 8 14