I am creating an interactive scatter plot which has thousands of data points, and I would like to dynamically find the outliers, in order to annotate only those points which are not too bunched together.
I am doing this currently in a slightly hackey way by using the following query, where users can provide values for q_x, q_y and q_xy (say 0.998, 0.994 and 0.95):
outliers = df[(df['x'] > df['x'].quantile(q_x)) | (df['y'] > df['y'].quantile(q_y))
| ((df['x'] > df['x'].quantile(q_xy)) & (df['y'] > df['y'].quantile(q_xy)))]
This kind of achieves what I want, but the user has to modify three variables to get their desired selection, and even then it's a bit uneven, as the three parts of the query focus on seperate sections of the data.
Is there a better, more mathematically sound way to find outliers for a set of x, y points?
Many thanks.
I found an extremely useful article which answered this for me.
The code I've used:
from sklearn.ensemble import IsolationForest
outliers = 50 # or however many you want
l = len(df.index)
isf = IsolationForest(
n_estimators=100,
random_state=42,
contamination=0.5 if outliers / l > 0.5 else outliers / l
)
preds = isf.fit_predict(df[['x', 'y']].to_numpy())
df["iso_forest_outliers"] = preds
Where outliers
is the number of outliers I want to limit the result to. Outliers are listed as -1 in the column 'iso_forest_outliers'
. The value of contamination
must be between 0 and 0.5, which is why there is the if else statement.