I have a pandas dataframe with latitude, longitude, and a measure for 100K+ GPS points.
df = pd.DataFrame({'lat': [41.260637, 45.720185, 45.720189, 45.720214, 45.720227, 46.085716, 46.085718, 46.085728, 46.085730, 46.085732],
'lng': [2.825920, 3.068014, 3.068113, 3.067929, 3.068199, 3.341655, 3.341534, 3.341476, 3.341546, 3.341476],
'measure': [6.30000, -0.375000, -0.375000, -0.375000, -0.375000, 0.000000, 0.000000, 0.555556, 0.714286, 0.645833]})
What I want to do is calculate, for each of these points, the average of the measure column for all points within a range of 10 meters.
I know how to calculate the distance between two points using geopy
from geopy.distance import distance
distance([gps_points.lat[3], gps_points.lng[3]], [gps_points.lat[4], gps_points.lng[4]]).m
21.06426497936181
But how would I go iterating on rows, selecting points in the 10m range and averaging the measure?
I'm guessing some sort of groupby, but can't figure out how.
In this example, the point itself is always included itself. Making it part of the average itself. You would need to modify that part if you want to exclude the point itself.
We can use BallTree
import pandas as pd
from sklearn.neighbors import BallTree
import numpy as np
And with your sample data
df = pd.DataFrame({'lat': [41.260637, 45.720185, 45.720189, 45.720214, 45.720227, 46.085716, 46.085718, 46.085728, 46.085730, 46.085732],
'lng': [2.825920, 3.068014, 3.068113, 3.067929, 3.068199, 3.341655, 3.341534, 3.341476, 3.341546, 3.341476],
'measure': [6.30000, -0.375000, -0.375000, -0.375000, -0.375000, 0.000000, 0.000000, 0.555556, 0.714286, 0.645833]})
We can create a Tree with
gps_pairs = df[["lat", "lng"]].values
radians = np.radians(gps_pairs)
tree = BallTree(radians, leaf_size=15, metric='haversine')
Now we need to scale to get radius of 10m (approx):
distance_in_meters = 10
earth_radius = 6371000
radius = distance_in_meters / earth_radius
Query that radius
with
is_within, distances = tree.query_radius(radians, r=radius, count_only=False, return_distance=True)
is_within
will contain the indices of points that fall within 10 meter.
Now you can calculate the average measure with:
measures = df[['measure']].values
average_measure_for_withins = np.array([ np.mean( measures[withins] ) for withins in is_within ])
And for instance add this to the DF
df['average_for_withins'] = average_measure_for_withins