Search code examples
pythonpandasloopsgeopy

Iterate geolocation over pandas dataframe


I have a dataframe that has two columns, Hospital name and Address, and I want to iterate through each address to find the latitude and longitude. My code seems to be taking the first row in the dataframe and I can't seem to select the address to find the coordinates.

import pandas
from geopy.geocoders import Nominatim

geolocator = Nominatim()
for index, item in df.iterrows():
    location = geolocator.geocode(item)
    df["Latitude"].append(location.latitude)
    df["Longitude"].append(location.longitude)

Here is the code I used to scrape the website. Copy and run this and you'll have the data set.

import requests
from bs4 import BeautifulSoup
import pandas
import numpy as np

r=requests.get("https://www.privatehealth.co.uk/hospitals-and-
clinics/orthopaedic-surgery/?offset=300")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all(["div"],{"class":"col-9"})
names = []
for item in all:
    d={}
    d["Hospital Name"] = item.find(["h3"],{"class":"mb6"}).text.replace("\n","")
    d["Address"] = item.find(["p"],{"class":"mb6"}).text.replace("\n","")
    names.append(d)
df=pandas.DataFrame(names)
df = df[['Hospital Name','Address']]
df

Currently the data looks like (one hospital example):

Hospital Name   |Address         
Fulwood Hospital|Preston, PR2 9SZ

The final output that I'm trying to achieve looks like.

Hospital Name   |Address         | Latitude | Longitude
Fulwood Hospital|Preston, PR2 9SZ|53.7589938|-2.7051618

Solution

  • Seems like there are a few issues here. Using data from the URL you provided:

    df.head()
                                    Hospital Name                   Address
    0                        Fortius Clinic City           London, EC4N 7BE
    1  Pinehill Hospital - Ramsay Health Care UK           Hitchin, SG4 9QZ
    2                  Spire Montefiore Hospital              Hove, BN3 1RD
    3             Chelsea & Westminster Hospital           London, SW10 9NH
    4   Nuffield Health Tunbridge Wells Hospital   Tunbridge Wells, TN2 4UL
    

    (1) If your data frame column names really are Hospital name and Address, then you need to use item.Address in the call to geocode().
    Just using item will give you both Hospital name and Address.

    for index, item in df.iterrows():
        print(f"index: {index}")
        print(f"item: {item}")
        print(f"item.Address only: {item.Address}")
    
    # Output:
    index: 0
    
    item: Hospital Name    Fortius Clinic City 
    Address              London, EC4N 7BE
    Name: 0, dtype: object
    
    item.Address only: London, EC4N 7BE
    ...
    

    (2) You noted that your data frame only has two columns. If that's true, you'll get a KeyError when you try to perform operations on df["Latitude"] and df["Longitude"], because they don't exist.

    (3) Using apply() on the Address column might be clearer than iterrows().
    Note that this is a stylistic point, and debatable. (The first two points are actual errors.)

    For example, using the provided URL:

    from geopy.geocoders import Nominatim
    geolocator = Nominatim()
    
    tmp = df.head().copy()
    
    latlon = tmp.Address.apply(lambda addr: geolocator.geocode(addr))
    
    tmp["Latitude"] = [x.latitude for x in latlon]
    tmp["Longitude"] = [x.longitude for x in latlon]
    

    Output:

                                    Hospital Name                   Address  \
    0                        Fortius Clinic City           London, EC4N 7BE   
    1  Pinehill Hospital - Ramsay Health Care UK           Hitchin, SG4 9QZ   
    2                  Spire Montefiore Hospital              Hove, BN3 1RD   
    3             Chelsea & Westminster Hospital           London, SW10 9NH   
    4   Nuffield Health Tunbridge Wells Hospital   Tunbridge Wells, TN2 4UL   
    
        Latitude  Longitude  
    0  51.507322  -0.127647  
    1  51.946413  -0.279165  
    2  50.840871  -0.180561  
    3  51.507322  -0.127647  
    4  51.131528   0.278068