I have two sets of lats and longs that I wish to join with a Cartesian join and find the distance between each pair. There can be duplicates in number
or other_number
(i.e. two locations/addresses per identifier).
d = {'number': ['100', '101'], 'lat': ['40.6892', '41.8902'], 'long': ['74.0445','12.4922']}
d2 = {'other_number': ['200', '201'], 'lat': ['37.8199', '43.8791'], 'long': ['122.4783','103.4591']}
data = pd.DataFrame(data=d)
data2 = pd.DataFrame(data=d2)
I am currently turning the lat/long fields into lists of tuples...
tuple_list_1 = list(zip(data.lat.astype(float), data.long.astype(float)))
tuple_list_2 = list(zip(data2.lat.astype(float), data2.long.astype(float)))
...and then performing the Cartesian join with a generator.
gen = ([x, y] for x in tuple_list_1 for y in tuple_list_2)
Finally, I am finding the distance with a simple loop:
from geopy.distance import geodesic
for u, v in gen:
dist = geodesic(u, v).miles
print(dist)
Ultimately, I would like the distance tied back to the original information (i.e. number
and other_number
). This is my desired result:
d3 = {'number': ['100', '100','100','100'],
'address': ['Statue of Liberty', 'Statue of Liberty', 'Colosseum', 'Colosseum'],
'other_number': ['200', '200', '201', '201'],
'other_address': ['Golden Gate Bridge','Mount Rushmore','Golden Gate Bridge','Mount Rushmore'],
'distance':[2572.262967759492,1515.3455804766047,5400.249562015358,4365.4386483486205]
}
data3 = pd.DataFrame(data=d3)
How do I retrieve the distance efficiently (I'm thinking looping through the generator may not be that efficient), and tie the results back to the identifying fields in a final DataFrame?
import pandas as pd
d = {'number': ['100', '101'], 'lat': ['40.6892', '41.8902'], 'long': ['74.0445','12.4922']}
d2 = {'other_number': ['200', '201'], 'lat': ['37.8199', '43.8791'], 'long': ['122.4783','103.4591']}
data = pd.DataFrame(data=d)
data2 = pd.DataFrame(data=d2)
# Perform cartesian product
data['key'] = 0
data2['key'] = 0
df = pd.merge(data, data2, on='key', how='outer')
df = df.drop('key', axis=1)
# Calculate distance
from geopy.distance import geodesic
df['distance'] = df.apply(lambda row: geodesic((row['lat_x'], row['long_x']), (row['lat_y'], row['long_y'])).miles, axis=1)
df
would look like this:
number lat_x long_x other_number lat_y long_y distance
0 100 40.6892 74.0445 200 37.8199 122.4783 2572.262968
1 100 40.6892 74.0445 201 43.8791 103.4591 1515.345580
2 101 41.8902 12.4922 200 37.8199 122.4783 5400.249562
3 101 41.8902 12.4922 201 43.8791 103.4591 4365.438648
There're other ways of performing cartesian product in pandas if you don't like this one via a new key
column, see cartesian product in pandas.