Search code examples
pythonperformancepandasnestednested-loops

Speeding up a nested for loop through two Pandas DataFrames


I have a latitude and longitude stored in a pandas dataframe (df) with filler spots as NaN for stop_id, stoplat, stoplon, and in another dataframe areadf, which contains more lats/lons and an arbitrary id; this is the information that is to be populated into df.

I'm trying to connect the two so that the stop columns in df contain information about the stop closest to that lat/lon point, or leave it as NaN if there is no stop within a radius R of the point.

Right now my code is as follows, but it takes a reaaaaallly long time (>40 minutes for what I'm running at the moment, before changing area to a df and using itertuples; not sure of what magnitude of difference this will make?) as there are thousands of lat/lon points and stops for each set of data, which is a problem because I need to run this on multiple files. I'm looking for suggestions to make it run faster. I've already made some very minor improvements (e.g. moving to a dataframe, using itertuples instead of iterrows, defining lats and lons outside of the loop to avoid having to retrieve it from df on every loop) but I'm out of ideas for speeding it up. getDistance uses the Haversine formula as defined to get the distance between the stop sign and the given lat,lon point.

import pandas as pd
from math import cos, asin, sqrt

R=5
lats = df['lat']
lons = df['lon']
for stop in areadf.itertuples():
    for index in df.index:
        if getDistance(lats[index],lons[index],
                       stop[1],stop[2]) < R:
            df.at[index,'stop_id'] = stop[0] # id
            df.at[index,'stoplat'] = stop[1] # lat
            df.at[index,'stoplon'] = stop[2] # lon

def getDistance(lat1,lon1,lat2,lon2):
    p = 0.017453292519943295     #Pi/180
    a = (0.5 - cos((lat2 - lat1) * p)/2 + cos(lat1 * p) * 
         cos(lat2 * p) * (1 - cos((lon2 - lon1) * p)) / 2)
    return 12742 * asin(sqrt(a)) * 100

Sample data:

df
lat        lon         stop_id    stoplat    stoplon
43.657676  -79.380146  NaN        NaN        NaN
43.694324  -79.334555  NaN        NaN        NaN

areadf
stop_id    stoplat    stoplon
0          43.657675  -79.380145
1          45.435143  -90.543253

Desired:

df
lat        lon         stop_id    stoplat    stoplon
43.657676  -79.380146  0          43.657675  -79.380145
43.694324  -79.334555  NaN        NaN        NaN

Solution

  • One way would be to use the numpy haversine function from here, just slightly modified so that you can account for the radius you want.

    The just iterate through your df with apply and find the closest value within a given radius

    def haversine_np(lon1, lat1, lon2, lat2,R):
        """
        Calculate the great circle distance between two points
        on the earth (specified in decimal degrees)
        All args must be of equal length.    
        """
        lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
        c = 2 * np.arcsin(np.sqrt(a))
        km = 6367 * c
        if km.min() <= R:
            return km.argmin()
        else:
            return -1
    
    df['dex'] = df[['lat','lon']].apply(lambda row: haversine_np(row[1],row[0],areadf.stoplon.values,areadf.stoplat.values,1),axis=1)
    

    Then merge the two dataframes.

    df.merge(areadf,how='left',left_on='dex',right_index=True).drop('dex',axis=1)
    
             lat        lon  stop_id    stoplat    stoplon
    0  43.657676 -79.380146      0.0  43.657675 -79.380145
    1  43.694324 -79.334555      NaN        NaN        NaN
    

    NOTE: If you choose to follow this method, you must be sure that both dataframes indexes are reset or that they are sequentially ordered from 0 to total len of df. So be sure to reset the indexes before you run this.

    df.reset_index(drop=True,inplace=True)
    areadf.reset_index(drop=True,inplace=True)