python machine-learning scikit-learn cluster-analysis dbscan

NameError: name 'labels_true' is not defined for dbscan

I am using a template script and trying to feed in my data. However, I am not sure what labels_true implies as the error states that is it undefined.

Here is my data array:

data=array([[5.71585827e+00, 3.32320000e+04],
       [0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00],
       ...,
       [9.57746479e-02, 3.40000000e+01],
       [7.01388889e-01, 1.01000000e+02],
       [9.70350404e-02, 3.60000000e+01]])

Now I am applying this script:

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.preprocessing import StandardScaler


# #############################################################################

X=data
X = StandardScaler().fit_transform(X)

# #############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
labels = db.labels_ 

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))


NameError: name 'labels_true' is not defined

Solution

From the documentation on scikit-learn homogeneity_score (emphasis added):

Homogeneity metric of a cluster labeling given a ground truth.

where labels_true are

ground truth class labels to be used as a reference

So, if you already have the ground truth, that would be the labels_true argument, which would be compared with your predicted labels to give the score.

Here the error is obviously because you have not provided such a ground truth in labels_true, and the variable is not defined, as the error says.

It comes as a direct consequence that, if the ground truth is not available, the metric cannot be used.