Search code examples
scikit-learndata-sciencecluster-analysisdata-analysisdbscan

What does it mean when cluster label is -1?


from sklearn.cluster import DBSCAN
model = DBSCAN(eps=3.3, leaf_size=5, min_samples=3)
y_pred = model.fit_predict(df)

my silhouette score is

from sklearn.metrics import silhouette_score
silhouette_score(df, y_pred)

output

0.4432857434946073

However, my labels are as so

code:

set(model.labels_)

output:

{-1, 0}

What does cluster -1 and 0 mean, and how do I right this?

note: I don't know if this is important, but

df.head()

output:

Gender      Age      education  satisfaction    salary  performance 
----------------------------------------------------------------------
0   0.0     0.446350    -1.010909   -0.891688   1.153254    -0.108350   
1   1.0     1.322365    -0.147150   -1.868426   -0.660853   -0.291719   
2   1.0     0.008343    -0.887515   -0.891688   0.246200    -0.937654   
3   0.0     -0.429664   -0.764121   1.061787    0.246200    -0.763634   
4   1.0     -1.086676   -0.887515   -1.868426   -0.660853   -0.644858   

As you can see, my data is multidimensional, and I can't reduce the dimension


Solution

  • As explained in the docs, -1 stands for noise: points alone in their cluster. This means points that have less than min_sample neighbors in the eps neighbourhood.

    Here you have a single cluster (0) and some noise (points with label -1).

    If you expected more clusters you should tweak eps and min_samples