python machine-learning scikit-learn data-analysis

Understand LocalOutlinerFactor algorithm by example

So I have worked through the sklearn example of LocalOutliner Detection and tried to apply it on a example dataset I have. But somehow the result itself does not really make sense to me.

What I have implemented looks like: (excluded the import stuff)

import numpy as np
import matplotlib.pyplot as plt
import pandas
from sklearn.neighbors import LocalOutlierFactor


# import file
url = ".../Python/outliner.csv"
names = ['R1', 'P1', 'T1', 'P2', 'Flag']
dataset = pandas.read_csv(url, names=names)    

array = dataset.values
X = array[:,0:2] 
rng = np.random.RandomState(42)


# fit the model
clf = LocalOutlierFactor(n_neighbors=50, algorithm='auto', leaf_size=30)
y_pred = clf.fit_predict(X)
y_pred_outliers = y_pred[500:]

# plot the level sets of the decision function
xx, yy = np.meshgrid(np.linspace(0, 1000, 50), np.linspace(0, 200, 50))
Z = clf._decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.title("Local Outlier Factor (LOF)")
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)

a = plt.scatter(X[:200, 0], X[:200, 1], c='white',
                edgecolor='k', s=20)
b = plt.scatter(X[200:, 0], X[200:, 1], c='red',
                edgecolor='k', s=20)
plt.axis('tight')
plt.xlim((0, 1000))
plt.ylim((0, 200))
plt.legend([a, b],
           ["normal observations",
            "abnormal observations"],
           loc="upper left")
plt.show()

I get something like this:

Can anybody tell me why the detection fails?

I have tried to play with the parameters and ranges but not much changes to the outliner detection itself.

Would be great if somebody can point me into the right direction with the issue. Thanks

Edit: Added the import: File

Solution

I assume you followed this example. That example tries to compare actual/observations data (scatter plot) vs decision function learned from them (contour plot). Since the data is known/made up (200 normal + 20 outliers), we can simply select the outliers by using X[200:] (index 200th onwards) and select the normal using X[:200] (index 0-199th).

So if you want to plot the prediction result (as scatter plot) instead of the actual/observation data, you would want to do it like code below. Basically you split the X based on the y_pred (1: normal, -1: outlier) then use it in scatter plot:

import numpy as np
import matplotlib.pyplot as plt
import pandas
from sklearn.neighbors import LocalOutlierFactor

# import file
url = ".../Python/outliner.csv"
names = ['R1', 'P1', 'T1', 'P2', 'Flag']
dataset = pandas.read_csv(url, names=names)
X = dataset.values[:, 0:2]

# fit the model
clf = LocalOutlierFactor(n_neighbors=50, algorithm='auto', leaf_size=30)
y_pred = clf.fit_predict(X)

# map results
X_normals = X[y_pred == 1]
X_outliers = X[y_pred == -1]

# plot the level sets of the decision function
xx, yy = np.meshgrid(np.linspace(0, 1000, 50), np.linspace(0, 200, 50))
Z = clf._decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.title("Local Outlier Factor (LOF)")
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)

a = plt.scatter(X_normals[:, 0], X_normals[:, 1], c='white', edgecolor='k', s=20)
b = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='red', edgecolor='k', s=20)
plt.axis('tight')
plt.xlim((0, 1000))
plt.ylim((0, 200))
plt.legend([a, b], ["normal predictions", "abnormal predictions"], loc="upper left")
plt.show()

As you can see the scatter plot of normal data will follow the contour plot: