Search code examples
pythonpandaslatitude-longitudenearest-neighborgeopy

Find closest (lat/lon) observation from one pandas dataframe for each observation in a second dataframe


Problem summary:

I have two dataframes. The first dataframe (df1) is relatively small (Nearly always less than 100 observations, typically less than 50), with a set of point identifiers and their lat/lon coordinates. The second dataframe (df2) is very large (hundreds of thousands of observations) which also has lat/lon coordinates. I wish to create two new columns in df2: the first has the identifier of the nearest point from df1, the second has the distance to that point. My current method is quite clunky and I think can be optimized significantly. For additional context, there is a single df1 (small dataframe), but I will be repeating this process for multiple df2s (large dataframes).

Setup/Sample Data:

# imports:
import pandas as pd
import geopy.distance
from faker import Faker

# creating sample data:
Faker.seed(0)
fake=Faker()

id1=[]
lat1=[]
lon1=[]
id2=[]
lat2=[]
lon2=[]
length1=10 # length of df1
length2=100 # length of df2

for x in range(length1):
    a=fake.local_latlng()
    id1.append(x)
    lat1.append(float(a[0]))
    lon1.append(float(a[1]))
for x in range(length2):
    a=fake.local_latlng()
    id2.append(x)
    lat2.append(float(a[0]))
    lon2.append(float(a[1]))

dict1={
    'loc_id' : id1,
    'lat' : lat1,
    'lon' : lon1,
    }

dict2={
    'point_id' : id2,
    'lat' : lat2,
    'lon' : lon2,
    }

df1=pd.DataFrame(dict1)
df2=pd.DataFrame(dict2)

Current Solution:

# calculating distances:
for x in range(len(df1)):
    loc_id=df1.iloc[x]['loc_id']
    pt1=(df1.iloc[x]['lat'],df1.iloc[x]['lon'])
    for y in range(len(df2)):
        pt2=(df2.iloc[y]['lat'],df2.iloc[y]['lon'])
        dist=geopy.distance.distance(pt1,pt2).miles
        df2.loc[y,x]=dist

# determining minimum distance and label:
temp_cols=list(range(len(df1)))
df2['min_dist']=df2[temp_cols].min(axis=1)
df2['min_loc']=df2[temp_cols].idxmin(axis=1)

# removing extra columns:
df2=df2.drop(temp_cols,axis=1)
print(df2.head())

Possible solutions:

This code is quite obviously slow, since I compute the distance for each pair of points. Conceptually, I think this can be improved, but I'm having trouble implementing the improvements. Some ideas:

  1. vectorize operations. This accepted answer seems to indicate that operations over vectors are faster, but I don't know how to implement the geopy.distance.distance() function over a vector (or if it's possible).
  2. eliminate points by comparing which are "dominated" so to speak. Such that if, for example, a point is larger on both lat/lon than another, I might be able to eliminate it when comparing to a point that is smaller in both lat/lon points from the set I have to check against. I imagine this increases work/processing on the front end, but pays off in the end by reducing the number of points I check for each point. Figuring out that algorithm is not obvious to me though.
  3. I might be able to do some sort of binning of points into groups that are nearby one another, therefore getting smaller sets of candidates to compare to one another. Maybe it's possible to figure out which is the closest point(s) before calculating distances. The danger is that some of the points in df1 might be quite close together as well.

Additional details: The odds of two points having identical distances is small, and I am happy randomly selecting any point that ties for closest if that should come up.


Solution

  • Based upon k-nearest neighbor using Balltree

    Approach

    1. Create k-nearest neighbor tree for lat/lon for df1. Use BallTree since it allows customized distance functions such as geopy.distance.distance
    2. For each lat/lon in df2, find its closest point in the tree from 1
    3. Note: Balltree would be even faster if we used a builtin distance function such as Haversine

    Code

    import pandas as pd
    import numpy as np
    from faker import Faker
    from sklearn.neighbors import BallTree
    from geopy import distance
    
    import functools
    import time
    
    # Timing Decorator
    def timer(func):
        """Print the runtime of the decorated function"""
        @functools.wraps(func)
        def wrapper_timer(*args, **kwargs):
            start_time = time.perf_counter()    # 1
            value = func(*args, **kwargs)
            end_time = time.perf_counter()      # 2
            run_time = end_time - start_time    # 3
            print(f"Finished {func.__name__!r} in {run_time:.4f} secs")
            return value
        return wrapper_timer
    
    def generate_data(length):
        '''
            Genearte data frame with random lat/lon data 
            Data with same length is always identical since we use same initial seed
        '''
        Faker.seed(0)
        fake = Faker()
    
        id = list(range(length))
        # First two fields of fake.local_latlng are lat & lon as string
        # Generate vector of fake.local_latlng then unpack out lat/lon array
        lat, lon = list(zip(*[fake.local_latlng() for _ in range(length)]))[:2]
        
        # Convert strings to float
        lat = [float(x) for x in lat]
        lon = [float(x) for x in lon]
        
        return pd.DataFrame({'point_id':id, 'lat':lat, 'lon':lon})
    
    def generate_balltree(df):
        '''
            Generate Balltree using customize distance (i.e. Geodesic distance)
        '''
        return  BallTree(df[['lat', 'lon']].values, metric=lambda u, v: distance.distance(u, v).miles)
    
    @timer
    def find_matches(tree, df):
        '''
            Find closest matches in df to items in tree
        '''
        distances, indices = tree.query(df[['lat', 'lon']].values, k = 1)
        df['min_dist'] = distances
        df['min_loc'] = indices
        
        return df
    
    @timer
    def find_min_op(df1, df2):
        ' OP solution (to compare timing) '
    
        for x in range(len(df1)):
            #loc_id=df1.iloc[x]['loc_id'] # not used
            pt1=(df1.iloc[x]['lat'],df1.iloc[x]['lon'])
            for y in range(len(df2)):
                pt2=(df2.iloc[y]['lat'],df2.iloc[y]['lon'])
                dist=distance.distance(pt1,pt2).miles
                df2.loc[y,x]=dist
    
        # determining minimum distance and label:
        temp_cols=list(range(len(df1)))
        df2['min_dist']=df2[temp_cols].min(axis=1)
        df2['min_loc']=df2[temp_cols].idxmin(axis=1)
    
        # removing extra columns:
        df2 = df2.drop(columns = temp_cols)
        
        return df2
    

    Tests

    df1 = 100 elements, df2 = 1000 elements

    l1, l2 = 100, 1000
    df1 = generate_data(l1)
    df2 = generate_data(l2)
    tree = generate_balltree(df1)
    find_matches(tree, df2)
    
    df2 = generate_data(l2)  # Regenerate df2 for next test since find_matches modified it
    find_min_op(df1, df2)
    

    Output

    Finished 'find_matches' in 32.1677 secs
    Finished 'find_min_op' in 147.7042 secs
    

    Thus, this approach ~5x faster in this test