Search code examples
cluster-analysiswekadata-miningdbscan

Inconsistent output from DBSCAN implementation in Weka


I am using the DBSCAN implementation in Weka and it seems to be giving me different results based on whether I select "Use training set" or "Classes to clusters evaluation" as the 'Cluster mode'. As per the documentation here, selecting "Classes to clusters evaluation" should only change the metrics reported.

With DBSCAN however I actually see a different number of clusters. Here's a way to reproduce the problem:

  1. Load the IRIS dataset: Select the "Preprocess" tab, click "Open file", go to the "data" folder inside your Weka installation and load the "iris" dataset.
  2. Go over to the "Cluster" tab and choose DBSCAN. Set epsilon=0.5 and minpts=5.
  3. In cluster mode, select the radio button "Use training set" and Start the clustering. Look for the string "Number of generated clusters" - this number is 3 for me.
  4. Now select the radio mode to "Classes to clusters evaluation" and re-run the clustering. I get 1 cluster now.

Is this expected behavior? Am I missing something?


Solution

  • What I seemed to be missing was with the "Use training set" setting all attributes including the class-label, are used. If I explicitly remove the class, the results match.