I have 2 (for presentation simplified) Dataframes:
table1_id | lat | long | table2_id |
---|---|---|---|
1 | 5.5 | 45.5 | |
2 | 5.2 | 50.2 | |
3 | 8.9 | 49.7 |
table2_id | lat | long |
---|---|---|
1 | 5.0 | 47.2 |
2 | 8.5 | 22.5 |
3 | 2.1 | 33.3 |
Table1 has >40000 rows. Table2 has 3000 rows.
What I want is to find the table2_id for each item in table 1 which has the shortest distance to its location using latitude/longitude and for the distance calculation I am using geopy.distance.
To do this the slow way is to iterate over each coordinate in table1 and for each of those iterate over all rows of table 2 to find the minimum. Which is very slow using DataFrame.iterrows() or DataFrame.apply.
Would look somewhat like that:
for idx, row in table1_df.iterrows():
location1 = (row["lat"], row["long"])
min_table2id = 0
min_distance = 9999999
for idx2, row2 in table2_df.iterrows():
location2 = (row2["lat"], row2["long"])
distance = geopy.distance.geodesic(location1, location2).km
if distance < min_distance:
min_distance = distance
min_table2id = row2[table2_id]
row[table2_id] = min_table2id
I've only done simple things over smaller dataset where speed was never a problem, but this is going for minutes, which was somewhat expected by those 2 for loops over that large of a table.
I am not too familiar with vectorization (only used it to manipulate single columns in a dataframe) and was wondering if there is a good way to vectorize this, or speed it up in another way. Thanks!
As far as I understand, the geodesic
function does not support vectorization. Therefore, you need to implement the distance calculation function for coordinate vectors yourself. Fortunately, there are many such implementations.
Here is a very simple implementation. Here is a complete example of the solution to your question. I have created two dataframes with the dimensions you provided.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'df1_id':range(40_000), 'lat':np.random.uniform(-10, 10, size=40_000), 'lon':np.random.uniform(-10, 10, size=40_000)})
df1['df2_id'] = np.nan
df2 = pd.DataFrame({'df2_id':range(3000), 'lat':np.random.uniform(-10, 10, size=3000), 'lon':np.random.uniform(-10, 10, size=3000)})
def haversine(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = np.radians([lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
haver_formula = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
r = 6371 #6371 for distance in KM for miles use 3958.756
dist = 2 * r * np.arcsin(np.sqrt(haver_formula))
return dist
def foo(row, table2_df):
size = table2_df.shape[0]
distances = haversine([row[1]]*size,[row[0]]*size, table2_df.lon.values, table2_df.lat.values)
return table2_df.iloc[np.argmin(distances), 0]
%%timeit
df1[['lat', 'lon']].apply(foo, axis=1, args=(df2,))
Out:
14 s ± 251 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can also parallelize the apply method using the parallel-pandas library. It's very simple and doesn't require you to rewrite your code.
# pip install parallel-pandas
from parallel_pandas import ParallelPandas
ParallelPandas.initialize(disable_pr_bar=True)
# p_apply is parallel analogue of apply method
%%timeit
df1[['lat', 'lon']].p_apply(foo, axis=1, args=(df2,))
Out:
1.79 s ± 75.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thus, the total time takes just over a second. I hope this is acceptable to you. Good luck!