I am trying to detect anomalies in a time series dataset. I am classifying the predicted values based on thresholds.
Here is a detailled description about what I did:
I splitted my total dataset into training and testing dataset then I fitted my ARIMA model on training dataset. I used the founded model to predict the testing observations than I calculated the error between actual and predicted values:
Error = actual_testing - predicted_testing
Normally, I must choose the threshold to classify each observation, based on the calculated error.
if the Error> threshold ==> it is an anomaly
is there any method to choose this threshold value?
One approach is to compute errors across your training or validation set. Then to fit a statistical distribution to the errors, for example a Gaussian (normal distribution). This has the effect of normalizing the range of the scores, and to allow to interpret a score as a probability. Then one can set a threshold for example at 2-6 standard deviations, depending on how many anomalies you want to flag.