I have two tables - Dealers and Customers. For each of the customer location from the customer table, I need to find the closest located dealer from the dealer table.
I have a code that works but takes a few hours to run. I need help in optimizing my solution.
The dealer table has 25k+ rows and customer table has 200k+ rows. Both tables have 3 main columns: (DealerID, Lat, Long) and (CustomerID, Lat, Long). My output looks something like this:
CustomerID | Lat | Long | ClosestDealer | Distance |
---|---|---|---|---|
Customer1 | 61.61 | -149.58 | Dealer3 | 15.53 |
Customer2 | 42.37 | -72.52 | Dealer258 | 8.02 |
Customer3 | 42.42 | -72.1 | Dealer1076 | 32.92 |
Customer4 | 31.59 | -89.87 | Dealer32 | 3.85 |
Customer5 | 36.75 | -94.84 | Dealer726 | 7.90 |
My current Solution: Iterating through all the rows of data to find the min. distance will take too long. To optimize this, I have grouped the data in both tables based on the rounded down version of lat and long points and then added them together to arrive at my final group (see 'LatLongGroup' column below).
CustomerID | Lat | Long | LatGroup | LongGroup | LatLongGroup |
---|---|---|---|---|---|
Customer1 | 61.61 | -149.58 | 61 | -149 | -88 |
Customer2 | 42.37 | -72.52 | 42 | -72 | -30 |
Customer3 | 42.42 | -72.1 | 42 | -72 | -30 |
Customer4 | 31.59 | -89.87 | 31 | -89 | -58 |
Customer5 | 36.75 | -94.84 | 36 | -94 | -58 |
Both these tables are sorted based on the 'LatLongGroup' column. And I have a separate table called group which provides the starting and ending row number of each group for the dealer table. I then match the records in the dealer table which have the same 'Latlonggroup' as that of the customerID. This helps me narrow down the search for the closest dealer.
But sometimes the closest dealer might not fall within the same group so to avoid any pitfalls, I search not only the group that matches but one above and below too. View Currently Used Code
Please let me know what would be the best way to optimize this or is there an easier way to find closest dealers for a large dataset like this. Any direction is greatly appreciated. Thank you!
col_names = ["CustomerKey","DealerKey","Dist"]
df = pd.DataFrame(columns = col_names)
c = 0
for i in range(0,len(df_c)):
print(i)
row = {'CustomerKey':df_c.loc[i,'ZIPBRANDKEY'],'DealerKey':'','Dist':0}
df = df.append(row, ignore_index=True)
a = group[group['LatLongGroup'] == df_c.LatLongGroup[i]].index[0]
if(a-1 >= 0):
start = group.loc[a-1,'Start']
else:
start = group.loc[a,'Start']
if(a+1 < len(group)):
end = group.loc[a+1,'End']
else:
end = group.loc[a,'End']
t1 = 0
for j in range(start,end):
dist = round(geopy.distance.distance(df_c.Lat_long[i], df_s.Lat_long[j]).miles,2)
if(t1 == 0):
min_dist = dist
dealerkey = df_s.loc[j,'DEALER_BRAND_KEY']
t1 = 1
elif(dist < min_dist):
min_dist = dist
dealerkey = df_s.loc[j,'DEALER_BRAND_KEY']
df.loc[c,'DealerKey'] = dealerkey
df.loc[c,'Dist'] = min_dist
c = c+1
df.head()
For reference, the above mentioned group dataframe looks like this:
Group | Start | End |
---|---|---|
-138 | 0 | 7 |
-137 | 7 | 15 |
-136 | 15 | 53 |
-135 | 53 | 55 |
-88 | 55 | 78 |
Generate sample data:
import pandas as pd
N = 25000
dealers = pd.DataFrame({"DealerID": "Dealer" + pd.RangeIndex(1, N+1).astype(str),
"Lat": np.random.uniform(30, 65, N),
"Long": np.random.uniform(-150, -70, N)}
).set_index("DealerID")
N = 200000
customers = pd.DataFrame({"CustomerID": "Customer" + pd.RangeIndex(1, N+1).astype(str),
"Lat": np.random.uniform(30, 65, N),
"Long": np.random.uniform(-150, -70, N)}
).set_index("CustomerID")
>>> dealers
Lat Long
DealerID
Dealer1 53.923040 -96.238974
Dealer2 33.375229 -136.379545
Dealer3 30.635395 -107.639308
Dealer4 50.264205 -97.563283
Dealer5 52.366663 -130.242301
... ... ...
Dealer24996 62.369789 -140.430366
Dealer24997 43.079035 -126.496873
Dealer24998 43.858461 -97.471257
Dealer24999 34.433920 -135.038754
Dealer25000 61.967902 -95.496924
[25000 rows x 2 columns]
>>> customers
Lat Long
CustomerID
Customer1 30.748900 -133.231319
Customer2 38.636134 -98.618844
Customer3 60.282135 -97.100096
Customer4 42.995473 -120.135218
Customer5 50.809563 -80.662491
... ... ...
Customer199996 47.387618 -88.420528
Customer199997 53.618939 -124.432385
Customer199998 58.506937 -146.024708
Customer199999 48.329325 -129.149631
Customer200000 36.599969 -145.019091
[200000 rows x 2 columns]
You can use KDTree from Scipy:
from scipy.spatial import KDTree
distances, indices = KDTree(dealers).query(customers)
Few seconds later:
>>> customers.assign(ClosestDealer=dealers.iloc[indices].index, Distance=distances)
Lat Long ClosestDealer Distance
CustomerID
Customer1 30.748900 -133.231319 Dealer22102 0.189255
Customer2 38.636134 -98.618844 Dealer1510 0.282966
Customer3 60.282135 -97.100096 Dealer2715 0.182832
Customer4 42.995473 -120.135218 Dealer10539 0.423006
Customer5 50.809563 -80.662491 Dealer12022 0.091765
... ... ... ... ...
Customer199996 47.387618 -88.420528 Dealer17124 0.325962
Customer199997 53.618939 -124.432385 Dealer9177 0.133110
Customer199998 58.506937 -146.024708 Dealer15558 0.299639
Customer199999 48.329325 -129.149631 Dealer18371 0.023172
Customer200000 36.599969 -145.019091 Dealer2316 0.199344
[200000 rows x 4 columns]