Search code examples

sklearn: Anomaly detection using Isolation Forests

I have a training dataset which contains no outliers:

(588649, 896)

And, I have another set of test vectors (test_vectors), and all of them are outliers.

Here is my attempt at doing the outlier detection:

from sklearn.ensemble import IsolationForest
clf = IsolationForest(max_samples=0.01)
y_pred_train = clf.predict(train_vectors)
print(np.count_nonzero(y_pred_train == 1))
print(np.count_nonzero(y_pred_train == -1))


So, here the outlier percentage is around 10% which is the default contamination parameter used for Isolation Forests in sklearn. Please note that there aren't any outliers in the training set.

Testing code and results:

y_pred_test = clf.predict(test_vectors)
print(np.count_nonzero(y_pred_test == 1))
print(np.count_nonzero(y_pred_test == -1))


So, it detects only 17 anomalies out of the 100. Can someone please tell me how to improve the performance. I am not at all sure why the algorithm requires the user to specify the contamination parameter. It is clear to me that it is used as a threshold, but how am I to know beforehand about the contamination level. Thank you!


  • IsolationForest works a bit differently than what you described :). The contamination is:

    The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function. link

    Which means that your train set should contain about 10% of outliers. Ideally, your test set should contain about the same amount of outliers also - and it should not consist of outliers only.

    train set and test set proportions
    |  normal ~ 90%                  | outliers 10%|

    Try to change your dataset proportions as described and try again with the code you posted!

    Hope this helps, good luck!

    P.S. You can also try OneClassSVM which is trained with the normal instances only - the test set should also be pretty much like above and not only outliers though.