Suppose I have a table like this
code | neighbours | t_min |
---|---|---|
abr007 | abr018, abr030, abr032 | 16.2 |
abr018 | abr007, abr030, laz246 | 20.9 |
abr030 | abr007, abr032, cmp015 | 21.9 |
... | ... | ... |
It contains observations of temperature for some weather stations having a unique id given by code
.
In order to find outliers I added a column with neighbouring stations (neighbours
), so as to compare the value of every station with the distribution given by the neighbouring observations: if it falls outside the [1st, 99th] percentiles interval I flag it as anomaly.
I've achieved this pretty easily by looping over the rows, then subset the original dataframe df[df.code.isin(row.neighbors)]
, compute percentiles over this subset and compare it with the station observation. However, this is relatively slow (about 8s for ~3000 rows) and I have the feeling that I could get rid of the loop and vectorize the operation by grouping the original dataframe over code
.
I've been thinking about this for a few days now and couldn't come up with an easy solution to wrap the same operation into a function that perform well when grouping...is that possible?
---- Details ----
Here is a snippet of the code that I currently use
for i, row in df.iterrows():
# Grab all the neighbours and take care of
# - excluding the station itself
# - excluding any station that has already been flagged as anomalous
subset = df[(df.code.isin(row.neighbors)) & (
df.code != row.code) & (~df['anomaly'])]
q1, q99 = subset['t_min'].quantile([qn1, qn2]).values
if ((df.loc[df.code == row.code,'t_min'].item() < q1) or
(df.loc[df.code == row.code, 't_min'].item() > q99)):
df.loc[df.code == row.code,'anomaly'] = True
Neibhours are found using a BallTree
algorithm (that part is actually pretty fast!).
I've found a better way by creating ad-hoc arrays. If you consider
variable = df['t_min'].values
indices
holds all the indices of the neighbours per stationThen I just compute the quantiles beforehand
var_quantiles = [np.nanquantile(variable[ind], [qn1, qn2]) if len(
ind) > 5 else np.array([-np.inf, np.inf]) for ind in indices]
then construct a pd.Interval
column that holds the limits given by the quantiles
df['interval'] = pd.IntervalIndex.from_arrays(np.array(var_quantiles).T[0] - thresh,
np.array( var_quantiles).T[1] + thresh,
closed='both')
Then I can just check individually the stations that fall out of the interval
df[anomaly_column_name] = df.apply(
lambda x: True if x[var] not in x['interval'] and not np.isnan(x[var]) else False, axis=1)
This takes less than a second instead than 12 :)