Search code examples
pythonscikit-learnoutliersanomaly-detection

Isolation Forest


I'm currently working on identifying outliers in my data set using the IsolationForest method in Python, but don't completely understand the example on sklearn:

http://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html#sphx-glr-auto-examples-ensemble-plot-isolation-forest-py

Specifically, what is the graph actually showing us? The observations have already been defined as normal/outliers -- so I'm assuming the shade of the contour plot indicates whether that observation is indeed an outlier (e.g., observations with higher anomaly scores lie in darker shaded areas?).

Lastly, how is the following section of code actually being used (specifically the y_pred functions)?

# fit the model
clf = IsolationForest(max_samples=100, random_state=rng)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers) 

I'm guessing it was just provided for completeness in the event someone wants to print the output?

Thanks in advance for the help!


Solution

  • For each observation, it tells whether or not (+1 or -1**) it should be considered as an outlier according to the fitted model.**


    Simple Example Using Iris data

    from sklearn.model_selection import train_test_split
    from sklearn.datasets import load_iris
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.ensemble import IsolationForest
    
    rng = np.random.RandomState(42)
    data = load_iris()
    
    X=data.data
    y=data.target
    X_outliers = rng.uniform(low=-4, high=4, size=(X.shape[0], X.shape[1]))
    
    X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=0)
    
    clf = IsolationForest(random_state=0)
    clf.fit(X_train)
    
    y_pred_train = clf.predict(X_train)
    y_pred_test = clf.predict(X_test)
    y_pred_outliers = clf.predict(X_outliers)
    
    print(y_pred_test)
    print(y_pred_outliers)
    

    Result:

    [-1 -1 -1 -1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1 -1
      1 -1 -1  1 -1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1 -1  1]
    
    [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
     -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
     -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
     -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
     -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
     -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
     -1 -1 -1 -1 -1 -1]
    

    Interpretation:

    The print(y_pred_test) returns 1 and -1. This means that some samples of X_test are not outliers and some are (source).

    On the other hand, print(y_pred_outliers) return only -1. This means that all the samples (150 in total for iris data) of X_outliers are outliers.


    Using your code

    After your code just print the y_pred_outliers:

    # fit the model
    clf = IsolationForest(max_samples=100, random_state=rng)
    clf.fit(X_train)
    y_pred_train = clf.predict(X_train)
    y_pred_test = clf.predict(X_test)
    y_pred_outliers = clf.predict(X_outliers) 
    
    print(y_pred_outliers)