python scikit-learn outliers anomaly-detection

Isolation Forest

I'm currently working on identifying outliers in my data set using the IsolationForest method in Python, but don't completely understand the example on sklearn:

http://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html#sphx-glr-auto-examples-ensemble-plot-isolation-forest-py

Specifically, what is the graph actually showing us? The observations have already been defined as normal/outliers -- so I'm assuming the shade of the contour plot indicates whether that observation is indeed an outlier (e.g., observations with higher anomaly scores lie in darker shaded areas?).

Lastly, how is the following section of code actually being used (specifically the y_pred functions)?

# fit the model
clf = IsolationForest(max_samples=100, random_state=rng)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

I'm guessing it was just provided for completeness in the event someone wants to print the output?

Thanks in advance for the help!

Solution

For each observation, it tells whether or not (+1 or -1**) it should be considered as an outlier according to the fitted model.**

Simple Example Using Iris data

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(42)
data = load_iris()

X=data.data
y=data.target
X_outliers = rng.uniform(low=-4, high=4, size=(X.shape[0], X.shape[1]))

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0)

clf = IsolationForest(random_state=0)
clf.fit(X_train)

y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

print(y_pred_test)
print(y_pred_outliers)

Result:

[-1 -1 -1 -1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1 -1
  1 -1 -1  1 -1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1 -1  1]

[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1]

Interpretation:

The print(y_pred_test) returns 1 and -1. This means that some samples of X_test are not outliers and some are (source).

On the other hand, print(y_pred_outliers) return only -1. This means that all the samples (150 in total for iris data) of X_outliers are outliers.

Using your code

After your code just print the y_pred_outliers:

# fit the model
clf = IsolationForest(max_samples=100, random_state=rng)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers) 

print(y_pred_outliers)