Search code examples
pythonpandasgeopy

How do I speed up geopy calculations for a pandas DataFrame with approx. 100k entries?


I have a pandas DataFrame called "orders" with approx. 100k entries containing address data (zip, city, country). For each entry, I would like to calculate the distance to a specific predefined address.

So far, I'm looping over the dataframe rows with a for-loop and using geopy to 1. get latitude and longitude values for each entry and 2. calculate the distance to my predefined address.

Although this works, it takes an awful lot of time (over 15 hours with an average of 2 iterations / second) and I assume that I haven't found the most efficient way yet. Although I did quite a lot of research and tried out different things like vectorization, these alternatives did not seem to speed up the process (maybe because I didn't implement them in the correct way, as I'm not a very experienced Python user).

This is my code so far:

def get_geographic_information():

    latitude = destination_geocode.latitude
    
    longitude = destination_geocode.longitude

    destination_coordinates = (latitude, longitude)

    distance = round(geopy.distance.distance(starting_point_coordinates, destination_coordinates).km, 2)
    
    return latitude, longitude, distance
import geopy
from geopy.geocoders import Nominatim
import geopy.distance

orders["Latitude"] = ""
orders["Longitude"] = ""
orders["Distance"] = ""

geolocator = Nominatim(user_agent="Project01")

starting_point = "my_address"
starting_point_geocode = geolocator.geocode(starting_point, timeout=10000)
starting_point_coordinates = (starting_point_geocode.latitude, starting_point_geocode.longitude)

for index in tqdm(range(len(orders))):
    destination_zip = orders.loc[index, "ZIP"]
    destination_city = orders.loc[index, "City"]
    destination_country = orders.loc[index, "Country"]
        
    destination = destination_zip + " " + destination_city + " " + destination_country
    destination_geocode = geolocator.geocode(destination, timeout=15000)
    
    if destination_geocode != None:
        geographic_information = get_geographic_information()
        
        orders.loc[index, "Latitude"] = geographic_information[0]
        
        orders.loc[index, "Longitude"] = geographic_information[1]
        
        orders.loc[index, "Distance"] = geographic_information[2]
    
    else:
        orders.loc[index, "Latitude"] = "-"
        
        orders.loc[index, "Longitude"] = "-"
        
        orders.loc[index, "Distance"] = "-"

From my previous research, I learned that the for-loop might be the problem, but I haven't managed to replace it yet. As this is my first question here, I'd appreciate any constructive feedback. Thanks in advance!


Solution

  • The speed of your script is likely limited by using Nominatim. They throttle the speed to 1 request per second as per this link:

    https://operations.osmfoundation.org/policies/nominatim/

    The only way to speed this script up would be to find a different service that allows bulk requests. Geopy has a list of geocoding services that it currently supports. Your best bet would be to look through this list and see if you find a service that handles bulk requests (e.g. Google V3. That would either allow you to make requests in batches or use a distributed process to speed things up.