Search code examples
pythonscikit-learnoutliersanomaly-detection

Is it mandatory to set contamination value for isolation forest in python?


I'm going to build a model to identify anomalies in my dataset. I researched a lot and found out the isolation forest is the best one so for. In my dataset, I don't have any labels (that means the dataset only contains explanatory variables). But I have no idea to set the contamination parameter in the isolation forest(Most of the articles that explain already has output variable [labeled as anomaly], using that they calculate outlier-ratio and then set it as contamination value).

Is it mandatory to set it?. The default value for contamination is 0.1. Is it okay to ignore it? if I didn't give value for it, does it affect the model results?

model = IsolationForest(contamination=0.1, n_estimators=1000)

Solution

  • No, it is not mandatory to set the contamination value. By default it is set to "auto".

    contamination‘auto’ or float, default=’auto’ The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the scores of the samples.

    Reference in documentation

    You can therefore ignore it, but it can/will affect the model results, as the predict method makes use of the threshold, which is set by the contamination value.

    The predict method makes use of a threshold on the raw scoring function computed by the estimator. This scoring function is accessible through the score_samples method, while the threshold can be controlled by the contamination parameter.

    Reference in documentation