Search code examples
scikit-learnsvmanomaly-detection

Good Anomaly Detection Model for a Complicated Data


I am working on data and want to produce an Anomaly Detection model for this data. The data contains only three features: Latitude, Longitude and Speed. I normalized it and then applied t-SNE then normalized again. There is no labeled or target data. So, it should be an unsupervised anomaly detection.

I cannot share the data since it is private. But, it seems like this:

img

There are some abnormal values in the data such as abnormal values:

img

Here's the final shape of the data:

img

As you can see, the data is a bit complicated. When I searched for abnormal instances manually (by looking at feature values), I observed that the instances inside the red circle (in the below image) should be detected as anomalies.

The instances inside the red region should be abnormal:

img

I used OneClassSVM to detect anomalies. Here are the parameters;

nu = 0.02
kernel = "rbf"
gamma = 0.1
degree = 3
verbose = False
random_state = rng

And the model;

# fit the model
clf = svm.OneClassSVM(nu=nu, kernel=kernel, gamma=gamma, verbose=verbose, random_state=random_state)
clf.fit(data_scaled)
y_pred_train = clf.predict(data_scaled)
n_error_train = y_pred_train[y_pred_train == -1].size

Here is what I obtained at the end:

img

Here is the detected anomalies of OneClassSVM and red instances were detected as anomalies:

img

So, as you can see, the model predicted many instances as anomalies, but in reality, most of these instances should be normal.

I tried different parameter values for nu, gamma and degree. However, I could not find a suitable decision line to detect only real anomalies.

  • What is wrong with my model? Should I try a different anomaly detection algorithm?
  • Is not my data appropriate for anomaly detection?

Solution

  • It appears some of the anomalies reported by One-class SVM are global but not local anomalies. You might want to try Local Outlier Factor.

    It will consider the local structure of your data. So the original outliers on the left side which are part of small clusters should not be as anomalous.

    http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html

    # fit the model
    clf = LocalOutlierFactor()
    y_pred_train = clf.fit_predict(data_scaled)
    n_error_train = y_pred_train[y_pred_train == -1].size
    

    I would also try Isolation Forest and try tweaking the contamination ratio. You don't have to scale your data for IF and I suspect you might not want to here.

    http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html#sklearn.ensemble.IsolationForest.predict

    # fit the model
    clf = IsolationForest(contamination=0.01)
    clf.fit(data)
    y_pred_train = clf.predict(data)
    n_error_train = y_pred_train[y_pred_train == -1].size