Search code examples
pythonmachine-learningscikit-learnunsupervised-learningdbscan

How to filter clusters produced by DBSCAN based on size?


I have applied DBSCAN to perform clustering on a dataset consisting of X, Y and Z coordinates of each point in a point cloud. I want to plot only the clusters which have less than 100 points. This is what I have so far:

clustering = DBSCAN(eps=0.1, min_samples=20, metric='euclidean').fit(only_xy)
plt.scatter(only_xy[:, 0], only_xy[:, 1],
        c=clustering.labels_, cmap='rainbow')
clusters = clustering.components_
#Store the labels
labels = clustering.labels_

#Then get the frequency count of the non-negative labels
counts = np.bincount(labels[labels>=0])

print(counts)

Output: 
[1278  564  208   47   36   30  191   54   24   18   40  915   26   20
   24  527   56  677   63   57   61 1544  512   21   45  187   39  132
   48   55  160   46   28   18   55   48   35   92   29   88   53   55
   24   52  114   49   34   34   38   52   38   53   69]

So I have found the number of points in each cluster, but I'm not sure how to select only the clusters which have less than 100 points.


Solution

  • You may find indexes of the labels where you have counts less than 100:

    ls, cs = np.unique(labels,return_counts=True)
    dic = dict(zip(ls,cs))
    idx = [i for i,label in enumerate(labels) if dic[label] <100 and label >= 0]
    

    Then you may apply resulting index to your DBSCAN results and labels like (more or less):

    plt.scatter(only_xy[idx, 0], only_xy[idx, 1],
            c=clustering.labels_[idx], cmap='rainbow')