Search code examples
pythonperformancegeopy

Is there a faster way than for loops and if statements to find nearest point to another point in python?


Is there a faster way (in Python, with a CPU) of doing the same thing as the function below? I've used For loops and if statements and wondering if there is a faster way? It currently takes about 1 minute per 100 postcodes to run this function, and I have about 70,000 to get through.

The 2 dataframes used are:

postcode_df which contains 71,092 rows, and columns:

  • Postcode e.g. "BL4 7PD"
  • Latitude e.g. 53.577653
  • Longitude e.g. -2.434136

e.g.

postcode_df = pd.DataFrame({"Postcode":["SK12 2LH", "SK7 6LQ"],
                                    "Latitude":[53.362549, 53.373812],
                                    "Longitude":[-2.061329, -2.120956]})

air which contains 421 rows, and columns:

  • TubeRef e.g. "ABC01"
  • Latitude e.g. 53.55108
  • Longitude e.g. -2.396236

e.g.

air = pd.DataFrame({"TubeRef":["Stkprt35", "Stkprt07", "Stkprt33"],
                                    "Latitude":[53.365085, 53.379502, 53.407510],
                                    "Longitude":[-2.0763, -2.120777, -2.145632]})

The function loops through each postcode in postcode_df, and for each postcode loops through each TubeRef and calculates (using geopy) the distance between them and saves the TubeRef with the shortest distance to the postcode.

The output df, postcode_nearest_tube_refs, contains the nearest tube per postcode and contains columns:

  • Postcode e.g. "BL4 7PD"
  • Nearest Air Tube e.g. "ABC01
  • Distance to Air Tube KM e.g. 1.035848
# define function to get nearest air quality monitoring tube per postcode
def get_nearest_tubes(constituency_list):
    
    postcodes = []
    nearest_tubes = []
    distances_to_tubes = []
    
    for postcode in postcode_df["Postcode"]:
            closest_tube = ""
            shortest_dist = 500

            postcode_lat = postcode_df.loc[postcode_df["Postcode"]==postcode, "Latitude"]
            postcode_long = postcode_df.loc[postcode_df["Postcode"]==postcode, "Longitude"]
            postcode_coord = (float(postcode_lat), float(postcode_long))


            for tuberef in air["TubeRef"]:
                tube_lat = air.loc[air["TubeRef"]==tuberef, "Latitude"]
                tube_long = air.loc[air["TubeRef"]==tuberef, "Longitude"]
                tube_coord = (float(tube_lat), float(tube_long))

                # calculate distance between postcode and tube
                dist_to_tube = geopy.distance.distance(postcode_coord, tube_coord).km
                if dist_to_tube < shortest_dist:
                    shortest_dist = dist_to_tube
                    closest_tube = str(tuberef)

            # save postcode's tuberef with shortest distance
            postcodes.append(str(postcode))
            nearest_tubes.append(str(closest_tube))
            distances_to_tubes.append(shortest_dist)
            
    # create dataframe of the postcodes, nearest tuberefs and distance
    postcode_nearest_tube_refs = pd.DataFrame({"Postcode":postcodes, 
                                          "Nearest Air Tube":nearest_tubes, 
                                          "Distance to Air Tube KM": distances_to_tubes})

    return postcode_nearest_tube_refs

Libraries I'm using are:

import numpy as np
import pandas as pd
# !pip install geopy
import geopy.distance

Solution

  • An working example here, taking seconds (<10).

    Import libraries

    import pandas as pd
    import numpy as np
    from sklearn.neighbors import BallTree
    import uuid
    

    I generate some random data, this takes a second as well, but at least we have some realistic amounts.

    np_rand_post = 5 * np.random.random((72000,2))
    np_rand_post = np_rand_post + np.array((53.577653, -2.434136))
    

    and use UUID for fake postcodes

    postcode_df = pd.DataFrame( np_rand_post , columns=['lat', 'long'])
    postcode_df['postcode'] = [uuid.uuid4().hex[:6] for _ in range(72000)]
    postcode_df.head()
    

    We do the same for the air

    np_rand = 5 * np.random.random((500,2))
    np_rand = np_rand + np.array((53.55108, -2.396236))
    

    and again use uuid for fake ref

    tube_df = pd.DataFrame( np_rand , columns=['lat', 'long'])
    tube_df['ref'] = [uuid.uuid4().hex[:5] for _ in range(500)]
    tube_df.head()
    

    extract gps values as numpy

    postcode_gps = postcode_df[["lat", "long"]].values
    air_gps = tube_df[["lat", "long"]].values
    

    create a balltree

    postal_radians =  np.radians(postcode_gps)
    air_radians = np.radians(air_gps)
    
    tree = BallTree(air_radians, leaf_size=15, metric='haversine')
    

    query for closest first

    distance, index = tree.query(postal_radians, k=1)
    

    Note that the distance is not in KM, you need to convert first.

    earth_radius = 6371000
    
    distance_in_meters = distance * earth_radius
    distance_in_meters
    

    And for instance get the ref with tube_df.ref[ index[:,0] ]