Search code examples
pythonh2oanomaly-detectionpycaret

Different results on anomaly detection bettween pycaret and H2O


I'm working on detect anomalies from the following data:
enter image description here

It comes from a processed signal of and hydraulic system, from there I know that the dots in the red boxes are anomalies happen when the system fails.

I'm using the first 3k records to train a model, both in pycaret and H20. These 3k records covers 5 cycles of data, as shown in the image bellow:

To train the model in pycaret I'm using the following code:

enter image description here

from pycaret.anomaly import *
from pycaret.datasets import get_data
import pandas as pd
exp_ano101 = setup(df[["Pressure_median_mw_2500_ac"]][0:3000], normalize = True, 
                   session_id = 123)

iforest = create_model('iforest')
unseen_predictions = predict_model(iforest, data=df[["Pressure_median_mw_2500_ac"]])
unseen_predictions = unseen_predictions.reset_index()

The results I get from pycaret are pretty good:

enter image description here

And with a bit of post processing I can get the follwing, which is quite close to the ideal:

enter image description here

On the other hand, using H20, with the following code:

import pandas as pd
from h2o.estimators import H2OIsolationForestEstimator, H2OGenericEstimator
import tempfile
ifr = H2OIsolationForestEstimator()
ifr.train(x="Pressure_median_mw_2500_ac",training_frame=hf)
th = df["mean_length"][0:3000].quantile(0.05)
df["anomaly"] = df["mean_length"].apply(lambda x: "1" if x> th  else "0")

I get this:

enter image description here

Which is a huge difference, since it is not detecting as anomalies this block:

enter image description here

My doubt is, how can I get similar results that the ones I get from pycaret given that I'm using the same algorithm, which is Isolation Forest. And even using SVM in Pycaret I get closer results than using isolation forest in H2O

enter image description here


Solution

  • TLDR: your problem would be massively simplified by changing the instances to detect anomalies to be cycles, not individual data samples from sensor. The differences between existing applied methods are probably due to differences in hyper-parameters, and the sensitivity to hyperparameters due to the less-than-ideal problem specification.

    This is a time-series, and your anomalies seem to be stateful - that is an anomaly starts to occur, and then affects many time-steps, then recovers again. However, you appear to be trying to detect anomalies in individual time-steps / samples, which will not work well, because in the anomalous condition the highest values are still within the normal range of individual datapoints from a normal condition. Furthermore there are strong temporal patterns in your data for the normal condition, and these are not possible to model with such an approach. That different softwares give different not-so-good results is expected, as tradeoffs will have to be made, and different hyperparameters will influence this.

    What you should do is to transform your original time-series to get instances that are more meaningful than individual point samples. The best for this kind of cyclic process with strong similarities between cycles, is to transform into a time-series for each cycle. This requires knowing (or reliably detecting) when a cycle starts.

    If cycle start is not available, one can instead use a sliding window approach, where the window is long enough to cover one or more cycles.

    Once you have such a set of windows, one can think about doing anomaly detection on it. Start with computing basic statistics that summaries the window (mean,std,min,max,max-min etc). The anomalies you have shown as an example will be trivially separable by the mean value of the cycle (or max or min). Don't need a isolation forest even, a Gaussian Mixture Model will do just fine, and allow for more interpretable results. This should work across a wide range of models and hyperparamters.

    Once a basic solution that captures such large discrepancies are in place, one can consider going further. Adding a sequence model autoencoder would for example be able to pick up much smaller deviations, if one has enough data.