Search code examples
pythonscikit-learnrandom-forestanomaly-detectionisolation-forest

How to use Isolation Forest in python


I'm working on detecting outliers in my unlabeled dataset (data are not labeled as inliers/outliers) and I'm using Isolation Forest in Python (scikit-learn library).
I want to get the anomaly score of the data in my dataset and so I'm using the following code:

if_model = IsolationForest(max_samples=100)
if_model.fit(dataset)
anomaly_score = if_model.score_samples(dataset)

However I have some questions:

  • Is the previous procedure correct or should I split my dataset in two parts, to perform the fit on a set and get the anomaly score on the other set?
  • what is the utility of the method predict? How should I use it?

Solution

    • To answer your first question, you do not need to split the data set. Test sets are needed for supervised algorithms. If you have an expected result for each row in the data, you can compare the model's output to the expected result to evaluate how well the model performs. This data cannot be used to fit the model, or the model might fit these specific rows of data well without fitting other data, and you would not know. Isolation forest, however, is an unsupervised algorithm. You do not have a list of anomalous rows to compare the isolation forest results against, so there is no use to hold back data to verify that the model works.

    • To answer the second question, predict gives a yes or no (1 or 0) answer as to whether each row is anomalous in the form of an array. score_samples returns a number representing how anomalous each row is but does not tell you whether it is anomalous or not. See sklearn documentation.