This example is from Data Science for dummies:
digits = load_digits()
X = digits.data
ground_truth = digits.target
pca = PCA(n_components=40)
Cx = pca.fit_transform(scale(X))
DB = DBSCAN(eps=4.35, min_samples=25, random_state=1)
DB.fit(Cx)
for k,cl in enumerate(np.unique(DB.labels_)):
if cl >= 0:
example = np.min(np.where(DB.labels_==cl)) # question 1
plt.subplot(2, 3, k)
plt.imshow(digits.images[example],cmap='binary', # question 2
interpolation='none')
plt.title('cl '+str(cl))
plt.show()
My questions are:
The output of the operation DB.labels_ == cl
is an array of Boolean such that (DB.labels_ == cl)[i]
is True
if DB.labels_[i] == cl
.
Thus np.where
is applied to the array DB.labels_ == cl
. And its ouput, if used on a single array, are the nonzero elements of this array, i.e. the element which are True
.
The operation np.where(DB.labels_ == cl)
returns the indices of the elements of DB.labels_
that are equals to cl
. These are the element of the data used in fit
that have been labeled by DB
as part of the cluster cl
.
In this case np.min
returns the smallest indice in the previous array. This means that it will retrieve the first element of your set that have been classified as being part of the cluster cl
. By looping thru all the clusters, you retrieve a set of examples of the images that constitute in your clusters.
This indices correspond to the one in data.image as DB.labels_
contains the labels of each of the point in the dataset that you feeded to DB.fit
. This dataset as the same order as data.images
.