Search code examples
scikit-learnoutliersanomaly-detection

sklearn: Anomaly detection using Isolation Forests


I have a training dataset which contains no outliers:

train_vectors.shape
(588649, 896)

And, I have another set of test vectors (test_vectors), and all of them are outliers.

Here is my attempt at doing the outlier detection:

from sklearn.ensemble import IsolationForest
clf = IsolationForest(max_samples=0.01)
clf.fit(train_vectors)
y_pred_train = clf.predict(train_vectors)
print(len(y_pred_train))
print(np.count_nonzero(y_pred_train == 1))
print(np.count_nonzero(y_pred_train == -1))

Output:
 588649
 529771
 58878

So, here the outlier percentage is around 10% which is the default contamination parameter used for Isolation Forests in sklearn. Please note that there aren't any outliers in the training set.

Testing code and results:

y_pred_test = clf.predict(test_vectors)
print(len(y_pred_test))
print(np.count_nonzero(y_pred_test == 1))
print(np.count_nonzero(y_pred_test == -1))

Output:
 100
 83
 17

So, it detects only 17 anomalies out of the 100. Can someone please tell me how to improve the performance. I am not at all sure why the algorithm requires the user to specify the contamination parameter. It is clear to me that it is used as a threshold, but how am I to know beforehand about the contamination level. Thank you!


Solution

  • IsolationForest works a bit differently than what you described :). The contamination is:

    The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function. link

    Which means that your train set should contain about 10% of outliers. Ideally, your test set should contain about the same amount of outliers also - and it should not consist of outliers only.

    train set and test set proportions
    ------------------------------------------------
    |  normal ~ 90%                  | outliers 10%|
    ------------------------------------------------
    

    Try to change your dataset proportions as described and try again with the code you posted!

    Hope this helps, good luck!

    P.S. You can also try OneClassSVM which is trained with the normal instances only - the test set should also be pretty much like above and not only outliers though.