python pandas dataframe outliers anomaly-detection

Use dataframe column to subset entire dataframe and apply function row-wise (anomaly detection)

Suppose I have a table like this

code	neighbours	t_min
abr007	abr018, abr030, abr032	16.2
abr018	abr007, abr030, laz246	20.9
abr030	abr007, abr032, cmp015	21.9
...	...	...

It contains observations of temperature for some weather stations having a unique id given by code.

In order to find outliers I added a column with neighbouring stations (neighbours), so as to compare the value of every station with the distribution given by the neighbouring observations: if it falls outside the [1st, 99th] percentiles interval I flag it as anomaly.

I've achieved this pretty easily by looping over the rows, then subset the original dataframe df[df.code.isin(row.neighbors)], compute percentiles over this subset and compare it with the station observation. However, this is relatively slow (about 8s for ~3000 rows) and I have the feeling that I could get rid of the loop and vectorize the operation by grouping the original dataframe over code.

I've been thinking about this for a few days now and couldn't come up with an easy solution to wrap the same operation into a function that perform well when grouping...is that possible?

---- Details ----

Here is a snippet of the code that I currently use

for i, row in df.iterrows():
    # Grab all the neighbours and take care of
    # - excluding the station itself
    # - excluding any station that has already been flagged as anomalous
    subset = df[(df.code.isin(row.neighbors)) & (
                df.code != row.code) & (~df['anomaly'])]
    q1, q99 = subset['t_min'].quantile([qn1, qn2]).values
    if ((df.loc[df.code == row.code,'t_min'].item() < q1) or
         (df.loc[df.code == row.code, 't_min'].item() > q99)):
        df.loc[df.code == row.code,'anomaly'] = True

Neibhours are found using a BallTree algorithm (that part is actually pretty fast!).

Solution

I've found a better way by creating ad-hoc arrays. If you consider

variable = df['t_min'].values
indices holds all the indices of the neighbours per station

Then I just compute the quantiles beforehand

var_quantiles = [np.nanquantile(variable[ind], [qn1, qn2]) if len(
        ind) > 5 else np.array([-np.inf, np.inf]) for ind in indices]

then construct a pd.Interval column that holds the limits given by the quantiles

df['interval'] = pd.IntervalIndex.from_arrays(np.array(var_quantiles).T[0] - thresh, 
np.array( var_quantiles).T[1] + thresh, 
closed='both')

Then I can just check individually the stations that fall out of the interval

df[anomaly_column_name] = df.apply(
        lambda x: True if x[var] not in x['interval'] and not np.isnan(x[var]) else False, axis=1)

This takes less than a second instead than 12 :)