How to speed up calculation to get the minimum value of row by row a calculation over 2 dataframes in pandas

I have 2 (for presentation simplified) Dataframes:

table1_id	lat	long
1	5.5	45.5
2	5.2	50.2
3	8.9	49.7

table2_id	lat	long
1	5.0	47.2
2	8.5	22.5
3	2.1	33.3

Table1 has >40000 rows. Table2 has 3000 rows.

What I want is to find the table2_id for each item in table 1 which has the shortest distance to its location using latitude/longitude and for the distance calculation I am using geopy.distance.

To do this the slow way is to iterate over each coordinate in table1 and for each of those iterate over all rows of table 2 to find the minimum. Which is very slow using DataFrame.iterrows() or DataFrame.apply.

Would look somewhat like that:

for idx, row in table1_df.iterrows():
    location1 = (row["lat"], row["long"])
    min_table2id = 0
    min_distance = 9999999
    for idx2, row2 in table2_df.iterrows():
        location2 = (row2["lat"], row2["long"])
        distance = geopy.distance.geodesic(location1, location2).km
        if distance < min_distance:
            min_distance = distance
            min_table2id = row2[table2_id]
    row[table2_id] = min_table2id

I've only done simple things over smaller dataset where speed was never a problem, but this is going for minutes, which was somewhat expected by those 2 for loops over that large of a table.

I am not too familiar with vectorization (only used it to manipulate single columns in a dataframe) and was wondering if there is a good way to vectorize this, or speed it up in another way. Thanks!

Solution

As far as I understand, the geodesic function does not support vectorization. Therefore, you need to implement the distance calculation function for coordinate vectors yourself. Fortunately, there are many such implementations. Here is a very simple implementation. Here is a complete example of the solution to your question. I have created two dataframes with the dimensions you provided.

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'df1_id':range(40_000), 'lat':np.random.uniform(-10, 10, size=40_000), 'lon':np.random.uniform(-10, 10, size=40_000)})
df1['df2_id'] = np.nan

df2 = pd.DataFrame({'df2_id':range(3000), 'lat':np.random.uniform(-10, 10, size=3000), 'lon':np.random.uniform(-10, 10, size=3000)})

def haversine(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = np.radians([lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1

    haver_formula = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2

    r = 6371  #6371 for distance in KM for miles use 3958.756
    dist = 2 * r * np.arcsin(np.sqrt(haver_formula))
    return dist


def foo(row, table2_df):
    size = table2_df.shape[0]
    distances = haversine([row[1]]*size,[row[0]]*size, table2_df.lon.values, table2_df.lat.values)
    return table2_df.iloc[np.argmin(distances), 0]

%%timeit
df1[['lat', 'lon']].apply(foo, axis=1, args=(df2,))

Out:
     14 s ± 251 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can also parallelize the apply method using the parallel-pandas library. It's very simple and doesn't require you to rewrite your code.

# pip install parallel-pandas
from parallel_pandas import ParallelPandas
ParallelPandas.initialize(disable_pr_bar=True)

# p_apply is parallel analogue of apply method
%%timeit
df1[['lat', 'lon']].p_apply(foo, axis=1, args=(df2,))

Out:
     1.79 s ± 75.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Thus, the total time takes just over a second. I hope this is acceptable to you. Good luck!