Search code examples
pythonpandasdataframeoutliersanomaly-detection

Use dataframe column to subset entire dataframe and apply function row-wise (anomaly detection)


Suppose I have a table like this

code neighbours t_min
abr007 abr018, abr030, abr032 16.2
abr018 abr007, abr030, laz246 20.9
abr030 abr007, abr032, cmp015 21.9
... ... ...

It contains observations of temperature for some weather stations having a unique id given by code.

In order to find outliers I added a column with neighbouring stations (neighbours), so as to compare the value of every station with the distribution given by the neighbouring observations: if it falls outside the [1st, 99th] percentiles interval I flag it as anomaly.

I've achieved this pretty easily by looping over the rows, then subset the original dataframe df[df.code.isin(row.neighbors)], compute percentiles over this subset and compare it with the station observation. However, this is relatively slow (about 8s for ~3000 rows) and I have the feeling that I could get rid of the loop and vectorize the operation by grouping the original dataframe over code.

I've been thinking about this for a few days now and couldn't come up with an easy solution to wrap the same operation into a function that perform well when grouping...is that possible?

---- Details ----

Here is a snippet of the code that I currently use

for i, row in df.iterrows():
    # Grab all the neighbours and take care of
    # - excluding the station itself
    # - excluding any station that has already been flagged as anomalous
    subset = df[(df.code.isin(row.neighbors)) & (
                df.code != row.code) & (~df['anomaly'])]
    q1, q99 = subset['t_min'].quantile([qn1, qn2]).values
    if ((df.loc[df.code == row.code,'t_min'].item() < q1) or
         (df.loc[df.code == row.code, 't_min'].item() > q99)):
        df.loc[df.code == row.code,'anomaly'] = True

Neibhours are found using a BallTree algorithm (that part is actually pretty fast!).


Solution

  • I've found a better way by creating ad-hoc arrays. If you consider

    • variable = df['t_min'].values
    • indices holds all the indices of the neighbours per station

    Then I just compute the quantiles beforehand

    var_quantiles = [np.nanquantile(variable[ind], [qn1, qn2]) if len(
            ind) > 5 else np.array([-np.inf, np.inf]) for ind in indices]
    

    then construct a pd.Interval column that holds the limits given by the quantiles

    df['interval'] = pd.IntervalIndex.from_arrays(np.array(var_quantiles).T[0] - thresh, 
    np.array( var_quantiles).T[1] + thresh, 
    closed='both')
    

    Then I can just check individually the stations that fall out of the interval

    df[anomaly_column_name] = df.apply(
            lambda x: True if x[var] not in x['interval'] and not np.isnan(x[var]) else False, axis=1)
    

    This takes less than a second instead than 12 :)