Is there a faster way (in Python, with a CPU) of doing the same thing as the function below? I've used For
loops and if
statements and wondering if there is a faster way? It currently takes about 1 minute per 100 postcodes to run this function, and I have about 70,000 to get through.
The 2 dataframes used are:
postcode_df
which contains 71,092 rows, and columns:
e.g.
postcode_df = pd.DataFrame({"Postcode":["SK12 2LH", "SK7 6LQ"],
"Latitude":[53.362549, 53.373812],
"Longitude":[-2.061329, -2.120956]})
air
which contains 421 rows, and columns:
e.g.
air = pd.DataFrame({"TubeRef":["Stkprt35", "Stkprt07", "Stkprt33"],
"Latitude":[53.365085, 53.379502, 53.407510],
"Longitude":[-2.0763, -2.120777, -2.145632]})
The function loops through each postcode in postcode_df, and for each postcode loops through each TubeRef and calculates (using geopy
) the distance between them and saves the TubeRef with the shortest distance to the postcode.
The output df, postcode_nearest_tube_refs
, contains the nearest tube per postcode and contains columns:
# define function to get nearest air quality monitoring tube per postcode
def get_nearest_tubes(constituency_list):
postcodes = []
nearest_tubes = []
distances_to_tubes = []
for postcode in postcode_df["Postcode"]:
closest_tube = ""
shortest_dist = 500
postcode_lat = postcode_df.loc[postcode_df["Postcode"]==postcode, "Latitude"]
postcode_long = postcode_df.loc[postcode_df["Postcode"]==postcode, "Longitude"]
postcode_coord = (float(postcode_lat), float(postcode_long))
for tuberef in air["TubeRef"]:
tube_lat = air.loc[air["TubeRef"]==tuberef, "Latitude"]
tube_long = air.loc[air["TubeRef"]==tuberef, "Longitude"]
tube_coord = (float(tube_lat), float(tube_long))
# calculate distance between postcode and tube
dist_to_tube = geopy.distance.distance(postcode_coord, tube_coord).km
if dist_to_tube < shortest_dist:
shortest_dist = dist_to_tube
closest_tube = str(tuberef)
# save postcode's tuberef with shortest distance
postcodes.append(str(postcode))
nearest_tubes.append(str(closest_tube))
distances_to_tubes.append(shortest_dist)
# create dataframe of the postcodes, nearest tuberefs and distance
postcode_nearest_tube_refs = pd.DataFrame({"Postcode":postcodes,
"Nearest Air Tube":nearest_tubes,
"Distance to Air Tube KM": distances_to_tubes})
return postcode_nearest_tube_refs
Libraries I'm using are:
import numpy as np
import pandas as pd
# !pip install geopy
import geopy.distance
An working example here, taking seconds (<10).
Import libraries
import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree
import uuid
I generate some random data, this takes a second as well, but at least we have some realistic amounts.
np_rand_post = 5 * np.random.random((72000,2))
np_rand_post = np_rand_post + np.array((53.577653, -2.434136))
and use UUID for fake postcodes
postcode_df = pd.DataFrame( np_rand_post , columns=['lat', 'long'])
postcode_df['postcode'] = [uuid.uuid4().hex[:6] for _ in range(72000)]
postcode_df.head()
We do the same for the air
np_rand = 5 * np.random.random((500,2))
np_rand = np_rand + np.array((53.55108, -2.396236))
and again use uuid for fake ref
tube_df = pd.DataFrame( np_rand , columns=['lat', 'long'])
tube_df['ref'] = [uuid.uuid4().hex[:5] for _ in range(500)]
tube_df.head()
extract gps values as numpy
postcode_gps = postcode_df[["lat", "long"]].values
air_gps = tube_df[["lat", "long"]].values
create a balltree
postal_radians = np.radians(postcode_gps)
air_radians = np.radians(air_gps)
tree = BallTree(air_radians, leaf_size=15, metric='haversine')
query for closest first
distance, index = tree.query(postal_radians, k=1)
Note that the distance is not in KM, you need to convert first.
earth_radius = 6371000
distance_in_meters = distance * earth_radius
distance_in_meters
And for instance get the ref with tube_df.ref[ index[:,0] ]