Search code examples
pythonmachine-learningscikit-learncluster-analysisdbscan

Scikit-Learn DBSCAN clustering yielding no clusters


I have a data set with a dozen dimensions (columns) and about 200 observations (rows). This dataset has been normalized using quantile_transform_normalize. (Edit: I tried running the clustering without normalization, but still no luck, so I don't believe this is the cause.) Now I want to cluster the data into several clusters. Until now I had been using KMeans, but I have read that it may not be accurate in higher dimensions and doesn't handle outliers well, so I wanted to compare to DBSCAN to see if I get a different result.

However, when I try to cluster the data with DBSCAN using the Mahalanobis distance metric, every item is clustered into -1. According to the documentation:

Noisy samples are given the label -1.

I'm not really sure what this means, but I was getting some OK clusters with KMeans so I know there is something there to cluster -- it's not just random.

Here is the code I am using for clustering:

covariance = np.cov(data.values.astype("float32"), rowvar=False)
clusterer = sklearn.cluster.DBSCAN(min_samples=6, metric="mahalanobis", metric_params={"V": covariance})
clusterer.fit(data)

And that's all. I know for certain that data is a numeric Pandas DataFrame as I have inspected it in the debugger.

What could be causing this issue?


Solution

  • You need to choose the parameter eps, too.

    DBSCAN results depend on this parameter very much. You can find some methods for estimating it in literature.

    IMHO, sklearn should not provide a default for this parameter, because it rarely ever works (on normalized toy data it is usually okay, but that's about it).

    200 instances probably is too small to reliably measure density, in particular with a dozen variables.