python machine-learning text-classification autoencoder anomaly-detection

Suppressing false positives (incorrectly classified as outlier/anomaly) in Anomaly Detection using Autoencoders

How does one suppress certain outliers in Anomaly detection?

We built a model using autoencoders and it has detected anomalies. Some of the data points which are flagged as anomalies (outside the normal distribution) are not actually anomalies.

How do we train the model to not recognize these as anomalies ?

Do we add multiple duplicates of these data points into the dataset and then train again, or are there any other techniques we can apply here.

Here the normal distribution is of Cosine Similarity (distance) since data points are vectorized representations of text data (log entries). So if the cosine distance between the input and reconstructed vector does not fall under the normal distribution it is treated as anomaly.

Solution

Since the Anomaly Detector is usually trained unsupervised, it can be hard to incorporate labels directly into that process without loosing outlier detection properties. A simple alternative is to take the instances that were marked as anomalies, and put them into a classifier that classifies into "real anomaly" vs "not real anomaly". This classifier would be trained on prior anomalies that have been labeled. It can be either binary classification, or one-class wrt to known "not real" samples. A simple starting point would be k-Nearest-Neighbours or a domain-specific distance function. The classifier can use the latent feature vector as input, or do its own feature extraction.

This kind of system is described in Anomaly Detection with False Positive Suppression (relayr.io). The same basic idea is used in this paper to minimize False Negative Rate: SNIPER: Few-shot Learning for Anomaly Detection to Minimize False-negative Rate with Ensured True-positive Rate