I have a latitude and longitude stored in a pandas dataframe (df
) with filler spots as NaN
for stop_id, stoplat, stoplon
, and in another dataframe areadf
, which contains more lats/lons and an arbitrary id; this is the information that is to be populated into df
.
I'm trying to connect the two so that the stop columns in df
contain information about the stop closest to that lat/lon point, or leave it as NaN
if there is no stop within a radius R of the point.
Right now my code is as follows, but it takes a reaaaaallly long time (>40 minutes for what I'm running at the moment, before changing area to a df and using itertuples; not sure of what magnitude of difference this will make?) as there are thousands of lat/lon points and stops for each set of data, which is a problem because I need to run this on multiple files. I'm looking for suggestions to make it run faster. I've already made some very minor improvements (e.g. moving to a dataframe, using itertuples instead of iterrows, defining lats and lons outside of the loop to avoid having to retrieve it from df on every loop) but I'm out of ideas for speeding it up. getDistance
uses the Haversine formula as defined to get the distance between the stop sign and the given lat,lon point.
import pandas as pd
from math import cos, asin, sqrt
R=5
lats = df['lat']
lons = df['lon']
for stop in areadf.itertuples():
for index in df.index:
if getDistance(lats[index],lons[index],
stop[1],stop[2]) < R:
df.at[index,'stop_id'] = stop[0] # id
df.at[index,'stoplat'] = stop[1] # lat
df.at[index,'stoplon'] = stop[2] # lon
def getDistance(lat1,lon1,lat2,lon2):
p = 0.017453292519943295 #Pi/180
a = (0.5 - cos((lat2 - lat1) * p)/2 + cos(lat1 * p) *
cos(lat2 * p) * (1 - cos((lon2 - lon1) * p)) / 2)
return 12742 * asin(sqrt(a)) * 100
Sample data:
df
lat lon stop_id stoplat stoplon
43.657676 -79.380146 NaN NaN NaN
43.694324 -79.334555 NaN NaN NaN
areadf
stop_id stoplat stoplon
0 43.657675 -79.380145
1 45.435143 -90.543253
Desired:
df
lat lon stop_id stoplat stoplon
43.657676 -79.380146 0 43.657675 -79.380145
43.694324 -79.334555 NaN NaN NaN
One way would be to use the numpy haversine function from here, just slightly modified so that you can account for the radius you want.
The just iterate through your df
with apply and find the closest value within a given radius
def haversine_np(lon1, lat1, lon2, lat2,R):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
if km.min() <= R:
return km.argmin()
else:
return -1
df['dex'] = df[['lat','lon']].apply(lambda row: haversine_np(row[1],row[0],areadf.stoplon.values,areadf.stoplat.values,1),axis=1)
Then merge the two dataframes.
df.merge(areadf,how='left',left_on='dex',right_index=True).drop('dex',axis=1)
lat lon stop_id stoplat stoplon
0 43.657676 -79.380146 0.0 43.657675 -79.380145
1 43.694324 -79.334555 NaN NaN NaN
NOTE: If you choose to follow this method, you must be sure that both dataframes indexes are reset or that they are sequentially ordered from 0 to total len of df. So be sure to reset the indexes before you run this.
df.reset_index(drop=True,inplace=True)
areadf.reset_index(drop=True,inplace=True)