Search code examples
pythonscikit-learnoutliers

Outlier detection with Local Outlier Factor (LOF)


I am working with healthcare insurance claims data and would like to identify fraudulent claims. Have been reading online to try and find a better method. I came across the following code on scikit-learn.org

Does anyone know how to select the outliers? the code plot them in a graph but I would like to select those outliers if possible.

I have tried appending the y_predictions to the x dataframe but that has not worked.

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor

np.random.seed(42)

# Generate train data
X = 0.3 * np.random.randn(100, 2)
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]

# fit the model
clf = LocalOutlierFactor(n_neighbors=20)
y_pred = clf.fit_predict(X)
y_pred_outliers = y_pred[200:]

Below is the code i tried.

X['outliers'] = y_pred

Solution

  • The first 200 data are inliers while the last 20 are outliers. When you did fit_predict on X, you will get either outlier (-1) or inlier(1) in y_pred. So to get the predicted outliers, you need to get those y_pred = -1 and get the corresponding value in X. Below script will give you the outliers in X.

    X_pred_outliers = [each[1] for each in list(zip(y_pred, X.tolist())) if each[0] == -1]
    

    I combine y_pred and X into an array and check if y=-1, if yes then collect X values.

    However, there are eight errors on the predictions (8 out of 220). These errors are -1 values in y_pred[:200] and 1 in y_pred[201:220]. Please be aware of the errors as well.