For a dataframe containing coordinate columns (e.g. 'x', 'y') I would like to check if the associated value 'val' deviates from the mean of 'val' in the local (distance to coordinates < radius) neighbourhood. I found following approach which is often used (e.g. here or here), building a KDTree and querying for each row the local mean. However I'm wondering if there is a better solution which prevents the dataframe iteration leading to a faster execution?
import pandas as pd
import numpy as np
from sklearn.neighbors import KDTree
xy = np.mgrid[0:10,0:10]
df = pd.DataFrame({'x':xy[0].ravel(), 'y':xy[1].ravel(), 'val':np.random.rand(100)})
tree = KDTree(df[['x', 'y']].values, metric='euclidean')
radius = 5
for i, row in df.iterrows():
coords = row[['x', 'y']].values.reshape(1, -1)
idx = tree.query_radius(coords, r=radius)[0]
df.loc[i, 'outlier'] = np.abs(row['val'] - df.iloc[idx]['val'].mean()) > df.iloc[idx]['val'].std()
df = df[df["outlier"] == False] #select df without outlier
There might be away to avoid looping all together that I haven't figured out yet, but an easy solution you can apply is to place your values needed into arrays, and then perform vectorized operations on those arrays. I did some tests and this averaged around 40% decrease in execution time.
coords = df[['x','y']].apply(lambda row: row.values.reshape(1,-1),axis=1)
df.coords = coords
idx = coords.apply(lambda x: tree.query_radius(x,r=radius)[0])
means = idx.apply(lambda x: df.loc[x,'val'].mean())
df.means = means
stds = idx.apply(lambda x: df.loc[x,'val'].std())
df.stds = stds
df['outlier']=np.abs(df['val']-df.means)>df.stds