Search code examples
pythonpandasdataframevectorization

How to speed up calculation to get the minimum value of row by row a calculation over 2 dataframes in pandas


I have 2 (for presentation simplified) Dataframes:

table1_id lat long table2_id
1 5.5 45.5
2 5.2 50.2
3 8.9 49.7
table2_id lat long
1 5.0 47.2
2 8.5 22.5
3 2.1 33.3

Table1 has >40000 rows. Table2 has 3000 rows.

What I want is to find the table2_id for each item in table 1 which has the shortest distance to its location using latitude/longitude and for the distance calculation I am using geopy.distance.

To do this the slow way is to iterate over each coordinate in table1 and for each of those iterate over all rows of table 2 to find the minimum. Which is very slow using DataFrame.iterrows() or DataFrame.apply.

Would look somewhat like that:

for idx, row in table1_df.iterrows():
    location1 = (row["lat"], row["long"])
    min_table2id = 0
    min_distance = 9999999
    for idx2, row2 in table2_df.iterrows():
        location2 = (row2["lat"], row2["long"])
        distance = geopy.distance.geodesic(location1, location2).km
        if distance < min_distance:
            min_distance = distance
            min_table2id = row2[table2_id]
    row[table2_id] = min_table2id

I've only done simple things over smaller dataset where speed was never a problem, but this is going for minutes, which was somewhat expected by those 2 for loops over that large of a table.

I am not too familiar with vectorization (only used it to manipulate single columns in a dataframe) and was wondering if there is a good way to vectorize this, or speed it up in another way. Thanks!


Solution

  • As far as I understand, the geodesic function does not support vectorization. Therefore, you need to implement the distance calculation function for coordinate vectors yourself. Fortunately, there are many such implementations. Here is a very simple implementation. Here is a complete example of the solution to your question. I have created two dataframes with the dimensions you provided.

    import pandas as pd
    import numpy as np
    
    df1 = pd.DataFrame({'df1_id':range(40_000), 'lat':np.random.uniform(-10, 10, size=40_000), 'lon':np.random.uniform(-10, 10, size=40_000)})
    df1['df2_id'] = np.nan
    
    df2 = pd.DataFrame({'df2_id':range(3000), 'lat':np.random.uniform(-10, 10, size=3000), 'lon':np.random.uniform(-10, 10, size=3000)})
    
    def haversine(lon1, lat1, lon2, lat2):
        lon1, lat1, lon2, lat2 = np.radians([lon1, lat1, lon2, lat2])
        dlon = lon2 - lon1
        dlat = lat2 - lat1
    
        haver_formula = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    
        r = 6371  #6371 for distance in KM for miles use 3958.756
        dist = 2 * r * np.arcsin(np.sqrt(haver_formula))
        return dist
    
    
    def foo(row, table2_df):
        size = table2_df.shape[0]
        distances = haversine([row[1]]*size,[row[0]]*size, table2_df.lon.values, table2_df.lat.values)
        return table2_df.iloc[np.argmin(distances), 0]
    
    %%timeit
    df1[['lat', 'lon']].apply(foo, axis=1, args=(df2,))
    
    Out:
         14 s ± 251 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    You can also parallelize the apply method using the parallel-pandas library. It's very simple and doesn't require you to rewrite your code.

    # pip install parallel-pandas
    from parallel_pandas import ParallelPandas
    ParallelPandas.initialize(disable_pr_bar=True)
    
    # p_apply is parallel analogue of apply method
    %%timeit
    df1[['lat', 'lon']].p_apply(foo, axis=1, args=(df2,))
    
    Out:
         1.79 s ± 75.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Thus, the total time takes just over a second. I hope this is acceptable to you. Good luck!