So this may seem like an odd question, but I have a pandas DataFrame
with addresses in it, that I want to geocode so I can get the latitude and longitude.
I have code that works using .apply()
thanks to this very helpful thread (new column with coordinates using geopy pandas), but my problem is that all of the open APIs have strict limits to how many requests per second they allow, and also requests per day.
I haven't been able to find any way to throttle my code so match the limits of the APIs. My DF has 25K rows, but I've only been able to successfully geocode if I create a subset of it with up to 5 rows.
I don't have a lot of experience with python and pandas, but in SAS the DATA steps iterate one row at a time, so I could have a sleep command that would throttle the requests. What would be the best way to implement something like that with python/pandas?
EDIT: So based on the answers so far, I wanted to confirm, my code would change from:
df_small['city_coord'] = df_small['Address'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
to:
df_small = df_clean[:5]
def f(x, delay=1):
# run your code
sleep(delay)
return geolocator.geocode(x)
df_small['city_coord'] = df_small['Address'].apply(f).apply(lambda x: (x.latitude, x.longitude))
To iterate with a delay, you can use df.iterrows()
and time.sleep()
:
from time import sleep
for row in df.iterrows():
# run your code
sleep(1) # how many seconds to wait
Or you can just put time.sleep()
within the apply
function itself (as @RafaelC suggests in the comments):
def f(x, delay=1):
# run your code
sleep(delay)
df.apply(f)