Search code examples
pythonanomaly-detection

Python Anomaly Detection (Pyod) is not converging


I am experimenting with PYOD using CBLOF to do anomaly detection. I have been unable to flag anomalies using this algorithm. I have found that when I run the CBLOF algorithm it throws the following error:

ValueError: Buffer dtype mismatch, expected 'INT' but got 'long long'

Exception ignored in: 'sklearn.cluster._k_means._assign_labels_csr' ValueError: Buffer dtype mismatch, expected 'INT' but got 'long long'

Which results in:

ValueError: Could not form valid cluster separation. Please change n_clusters or change clustering method

It appears that the CBLOF algorithm is dependent on sklearn.cluster and the expected data type that is being passed to skelearn from pyod is not what is expected.

Below are four scenarios that I have prepared using different parameters for CBLOF. Note that the same error is thrown regardless of changing theses parameters.

I have also tried changing the cluster size using the elbow method to find the optimal K in the Kmeans scenario.

Sample code:

from pyod.models.cblof import CBLOF
import pyod.utils as ut
from sklearn import cluster

#create some data
data = ut.data.generate_data()[0]

#scenario 1 - use default CBLOF parameters
model = CBLOF()
clusters = model.fit_predict(data)

#scenario 2 - use kmeans as a centroid estimator
n_clusters = 3
kmeans = cluster.KMeans(n_clusters)
model = CBLOF(n_clusters = n_clusters, clustering_estimator = kmeans)
clusters = model.fit_predict(data)

#test if scaling the data makes a difference
data_scaled = (data - data.min())/(data.max()-data.min())

#scenario 3 - no clusters specified, use defaults, scaled data
model = CBLOF()
clusters = model.fit_predict(data_scaled)

#scenario 4 - use kmeans as a centroid estimator, scaled data
n_clusters
kmeans = cluster.KMeans(n_clusters)
model = CBLOF(n_clusters = n_clusters, clustering_estimator = kmeans)
clusters = model.fit_predict(data_scaled)

All the packages I am using are up to date, and I have also tried using different data types in my input array.

Why are these errors being thrown?


Solution

  • The issue was due to sklearn and PyOD not being updated.