Search code examples
python-3.xpandasmathscatter-plotquantile

Finding outliers in a scatter plot / pandas dataframe?


I am creating an interactive scatter plot which has thousands of data points, and I would like to dynamically find the outliers, in order to annotate only those points which are not too bunched together.

I am doing this currently in a slightly hackey way by using the following query, where users can provide values for q_x, q_y and q_xy (say 0.998, 0.994 and 0.95):

outliers = df[(df['x'] > df['x'].quantile(q_x)) | (df['y'] > df['y'].quantile(q_y))
          | ((df['x'] > df['x'].quantile(q_xy)) & (df['y'] > df['y'].quantile(q_xy)))]

This kind of achieves what I want, but the user has to modify three variables to get their desired selection, and even then it's a bit uneven, as the three parts of the query focus on seperate sections of the data.

Is there a better, more mathematically sound way to find outliers for a set of x, y points?

Many thanks.


Solution

  • I found an extremely useful article which answered this for me.

    The code I've used:

    from sklearn.ensemble import IsolationForest
    
    outliers = 50 # or however many you want
    l = len(df.index)
    isf = IsolationForest(
        n_estimators=100,
        random_state=42,
        contamination=0.5 if outliers / l > 0.5 else outliers / l
    )
    preds = isf.fit_predict(df[['x', 'y']].to_numpy())
    df["iso_forest_outliers"] = preds
    

    Where outliers is the number of outliers I want to limit the result to. Outliers are listed as -1 in the column 'iso_forest_outliers'. The value of contamination must be between 0 and 0.5, which is why there is the if else statement.