I am working on data and want to produce an Anomaly Detection model for this data. The data contains only three features: Latitude
, Longitude
and Speed
. I normalized it and then applied t-SNE
then normalized again. There is no labeled or target data. So, it should be an unsupervised anomaly detection.
I cannot share the data since it is private. But, it seems like this:
There are some abnormal values in the data such as abnormal values:
Here's the final shape of the data:
As you can see, the data is a bit complicated. When I searched for abnormal instances manually (by looking at feature values), I observed that the instances inside the red circle (in the below image) should be detected as anomalies.
The instances inside the red region should be abnormal:
I used OneClassSVM
to detect anomalies. Here are the parameters;
nu = 0.02
kernel = "rbf"
gamma = 0.1
degree = 3
verbose = False
random_state = rng
And the model;
# fit the model
clf = svm.OneClassSVM(nu=nu, kernel=kernel, gamma=gamma, verbose=verbose, random_state=random_state)
clf.fit(data_scaled)
y_pred_train = clf.predict(data_scaled)
n_error_train = y_pred_train[y_pred_train == -1].size
Here is what I obtained at the end:
Here is the detected anomalies of OneClassSVM
and red instances were detected as anomalies:
So, as you can see, the model predicted many instances as anomalies, but in reality, most of these instances should be normal.
I tried different parameter values for nu
, gamma
and degree
. However, I could not find a suitable decision line to detect only real anomalies.
It appears some of the anomalies reported by One-class SVM are global but not local anomalies. You might want to try Local Outlier Factor.
It will consider the local structure of your data. So the original outliers on the left side which are part of small clusters should not be as anomalous.
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html
# fit the model
clf = LocalOutlierFactor()
y_pred_train = clf.fit_predict(data_scaled)
n_error_train = y_pred_train[y_pred_train == -1].size
I would also try Isolation Forest and try tweaking the contamination ratio. You don't have to scale your data for IF and I suspect you might not want to here.
# fit the model
clf = IsolationForest(contamination=0.01)
clf.fit(data)
y_pred_train = clf.predict(data)
n_error_train = y_pred_train[y_pred_train == -1].size