Search code examples
pythonpandasgeolocationgoogle-geocoder

How should I go about geolocating 1,100,000 lines of coordinate information?


Okay, so I'm trying to envision a solution for this. I have a database with over a million lines which includes a city name in the US and a set of coordinates for that city. The problem is that there are multiple cities with the same name: Springfield, NJ and Springfield, MA, for example. So I need to get the state information.

There are also duplicates within the data. There are only about 6500 sets of unique coordinates, so conceivably, I could locate those and then assign them to the other entries in the database. Is this a feasible plan? How would I go about this?

Here are some examples of what entries in this database look like:

2015-09-01 00:00:00,Buffalo,"42.9405299,-78.8697906",10.1016/s0894-7317(12)00840-1,42.9405299,-78.8697906,43.0,-79.0
2015-09-01 00:00:00,New York,"40.7830603,-73.9712488",10.1016/j.jmv.2014.04.008,40.783060299999995,-73.9712488,41.0,-74.0
2015-09-01 00:00:04,Scottsdale,"33.4941704,-111.9260519",10.1016/j.euroneuro.2014.05.008,33.494170399999994,-111.9260519,33.0,-112.0
2015-09-01 00:00:09,Provo,"40.2338438,-111.6585337",10.1016/j.toxac.2014.07.002,40.233843799999995,-111.6585337,40.0,-112.0
2015-09-01 00:00:13,New York,"40.7830603,-73.9712488",10.1016/j.drugalcdep.2014.09.015,40.783060299999995,-73.9712488,41.0,-74.0
2015-09-01 00:00:16,Fremont,"37.5482697,-121.9885719",10.1016/j.ajic.2012.04.160,37.548269700000006,-121.98857190000001,38.0,-122.0
2015-09-01 00:00:24,Provo,"40.2338438,-111.6585337",10.1016/j.chroma.2015.01.036,40.233843799999995,-111.6585337,40.0,-112.0

I am using the geocoder package for geolocation. Here is some code I've written that could handle that:

def convert_to_state(lati, long):
    lat, lon = float(lati), float(long)
    g = geocoder.google([lat, lon], method='reverse')
    state_long, state_short = g.state_long, g.state
    return state_long, state_short

I'm just not sure how to do this. Turns out geocoding is pretty expensive, so using the duplicates is probably the best way forward. Any suggestions for how to accomplish that?


Solution

  • I like the hash table idea, but here is an alternative using some pandas stuff:

    1) get a unique list of (lat, lon) coords

    df['latlon'] = [(x,y) for x,y in zip(df['lati'].tolist(),df['long'].tolist())]
    unique_ll = df['latlon'].unique()
    

    2) loop through unique coords and set the state for all equivalent lines

    for l in unique_ll:
        df.loc[df['latlon'] == l, 'state'] = convert_to_state(l[0],l[1])