Search code examples
pythonpandasvectorizationkdtree

avoiding iterrows for querying local outlier


For a dataframe containing coordinate columns (e.g. 'x', 'y') I would like to check if the associated value 'val' deviates from the mean of 'val' in the local (distance to coordinates < radius) neighbourhood. I found following approach which is often used (e.g. here or here), building a KDTree and querying for each row the local mean. However I'm wondering if there is a better solution which prevents the dataframe iteration leading to a faster execution?

import pandas as pd
import numpy as np
from sklearn.neighbors import KDTree

xy = np.mgrid[0:10,0:10]
df = pd.DataFrame({'x':xy[0].ravel(), 'y':xy[1].ravel(), 'val':np.random.rand(100)})

tree = KDTree(df[['x', 'y']].values, metric='euclidean')

radius = 5
for i, row in df.iterrows():
    coords = row[['x', 'y']].values.reshape(1, -1)
    idx = tree.query_radius(coords, r=radius)[0]
    df.loc[i, 'outlier'] = np.abs(row['val'] - df.iloc[idx]['val'].mean()) > df.iloc[idx]['val'].std()
df = df[df["outlier"] == False] #select df without outlier

Solution

  • There might be away to avoid looping all together that I haven't figured out yet, but an easy solution you can apply is to place your values needed into arrays, and then perform vectorized operations on those arrays. I did some tests and this averaged around 40% decrease in execution time.

    coords = df[['x','y']].apply(lambda row: row.values.reshape(1,-1),axis=1)
    df.coords = coords
    idx = coords.apply(lambda x: tree.query_radius(x,r=radius)[0])
    means = idx.apply(lambda x: df.loc[x,'val'].mean())
    df.means = means
    stds = idx.apply(lambda x: df.loc[x,'val'].std())
    df.stds = stds
    df['outlier']=np.abs(df['val']-df.means)>df.stds