Is there a faster way than for loops and if statements to find nearest point to another point in python?

Is there a faster way (in Python, with a CPU) of doing the same thing as the function below? I've used For loops and if statements and wondering if there is a faster way? It currently takes about 1 minute per 100 postcodes to run this function, and I have about 70,000 to get through.

The 2 dataframes used are:

postcode_df which contains 71,092 rows, and columns:

Postcode e.g. "BL4 7PD"
Latitude e.g. 53.577653
Longitude e.g. -2.434136

e.g.

postcode_df = pd.DataFrame({"Postcode":["SK12 2LH", "SK7 6LQ"],
                                    "Latitude":[53.362549, 53.373812],
                                    "Longitude":[-2.061329, -2.120956]})

air which contains 421 rows, and columns:

TubeRef e.g. "ABC01"
Latitude e.g. 53.55108
Longitude e.g. -2.396236

e.g.

air = pd.DataFrame({"TubeRef":["Stkprt35", "Stkprt07", "Stkprt33"],
                                    "Latitude":[53.365085, 53.379502, 53.407510],
                                    "Longitude":[-2.0763, -2.120777, -2.145632]})

The function loops through each postcode in postcode_df, and for each postcode loops through each TubeRef and calculates (using geopy) the distance between them and saves the TubeRef with the shortest distance to the postcode.

The output df, postcode_nearest_tube_refs, contains the nearest tube per postcode and contains columns:

Postcode e.g. "BL4 7PD"
Nearest Air Tube e.g. "ABC01
Distance to Air Tube KM e.g. 1.035848

# define function to get nearest air quality monitoring tube per postcode
def get_nearest_tubes(constituency_list):
    
    postcodes = []
    nearest_tubes = []
    distances_to_tubes = []
    
    for postcode in postcode_df["Postcode"]:
            closest_tube = ""
            shortest_dist = 500

            postcode_lat = postcode_df.loc[postcode_df["Postcode"]==postcode, "Latitude"]
            postcode_long = postcode_df.loc[postcode_df["Postcode"]==postcode, "Longitude"]
            postcode_coord = (float(postcode_lat), float(postcode_long))


            for tuberef in air["TubeRef"]:
                tube_lat = air.loc[air["TubeRef"]==tuberef, "Latitude"]
                tube_long = air.loc[air["TubeRef"]==tuberef, "Longitude"]
                tube_coord = (float(tube_lat), float(tube_long))

                # calculate distance between postcode and tube
                dist_to_tube = geopy.distance.distance(postcode_coord, tube_coord).km
                if dist_to_tube < shortest_dist:
                    shortest_dist = dist_to_tube
                    closest_tube = str(tuberef)

            # save postcode's tuberef with shortest distance
            postcodes.append(str(postcode))
            nearest_tubes.append(str(closest_tube))
            distances_to_tubes.append(shortest_dist)
            
    # create dataframe of the postcodes, nearest tuberefs and distance
    postcode_nearest_tube_refs = pd.DataFrame({"Postcode":postcodes, 
                                          "Nearest Air Tube":nearest_tubes, 
                                          "Distance to Air Tube KM": distances_to_tubes})

    return postcode_nearest_tube_refs

Libraries I'm using are:

import numpy as np
import pandas as pd
# !pip install geopy
import geopy.distance

Solution

An working example here, taking seconds (<10).

Import libraries

import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree
import uuid

I generate some random data, this takes a second as well, but at least we have some realistic amounts.

np_rand_post = 5 * np.random.random((72000,2))
np_rand_post = np_rand_post + np.array((53.577653, -2.434136))

and use UUID for fake postcodes

postcode_df = pd.DataFrame( np_rand_post , columns=['lat', 'long'])
postcode_df['postcode'] = [uuid.uuid4().hex[:6] for _ in range(72000)]
postcode_df.head()

We do the same for the air

np_rand = 5 * np.random.random((500,2))
np_rand = np_rand + np.array((53.55108, -2.396236))

and again use uuid for fake ref

tube_df = pd.DataFrame( np_rand , columns=['lat', 'long'])
tube_df['ref'] = [uuid.uuid4().hex[:5] for _ in range(500)]
tube_df.head()

extract gps values as numpy

postcode_gps = postcode_df[["lat", "long"]].values
air_gps = tube_df[["lat", "long"]].values

create a balltree

postal_radians =  np.radians(postcode_gps)
air_radians = np.radians(air_gps)

tree = BallTree(air_radians, leaf_size=15, metric='haversine')

query for closest first

distance, index = tree.query(postal_radians, k=1)

Note that the distance is not in KM, you need to convert first.

earth_radius = 6371000

distance_in_meters = distance * earth_radius
distance_in_meters

And for instance get the ref with tube_df.ref[ index[:,0] ]