python algorithm machine-learning scikit-learn sampling

Can you use the isolation forest algorithm on large sample sizes?

I've been using the scikit learn sklearn.ensemble.IsolationForest implementation of the isolation forest to detect anomalies in my datasets that range from 100s of rows to millions of rows worth of data. It seems to be working well and I've overridden the max_samples to a very large integer to handle some of my larger datasets (essentially not using sub-sampling). I noticed that the original paper states that larger sample sizes create risk of swamping and masking.

Is it okay to use the isolation forest on large sample sizes if it seems to be working okay? I tried training with a smaller max_samples and the testing produced too many anomalies. My data has really started to grow and I'm wondering if a different anomaly detection algorithm would be better for such a large sample size.

Solution

Citing the original paper:

The isolation characteristic of iTrees enables them to build partial models and exploit sub-sampling to an extent that is not feasible in existing methods. Since a large part of an iTree that isolates normal points is not needed for anomaly detection; it does not need to be constructed. A small sample size produces better iTrees because the swamping and masking effects are reduced.

From you question, I have a feeling that you confuse between the size of the dataset, and the size of the sample you take from it to construct iTree. The Isolation forest can handle very large datasets. It works better when it samples them.

The original paper discusses it in chapter 3:

The data set has two anomaly clusters located close to one large cluster of normal points at the centre. There are interfering normal points surrounding the anomaly clusters, and the anomaly clusters are denser than normal points in this sample of 4096 instances. Figure 4(b) shows a sub-sample of 128 instances of the original data. The anomalies clusters are clearly identifiable in the sub-sample. Those normal instances surrounding the two anomaly clusters have been cleared out, and the size of anomaly clusters becomes smaller which makes them easier to identify. When using the entire sample, iForest reports an AUC of 0.67. When using a sub-sampling size of 128, iForest achieves an AUC of 0.91.

Isolation forest is not a perfect algorithm and needs parameter tuning for your specific data. It might even perform poorly on some datasets. If you wish to consider other methods, Local Outlier Factor is also included in sklearn. You may also combine several methods (ensemble).

Here you can find a nice comparison of different methods.