I've been trying to use the DBSCAN clusterer from Weka to cluster instances. From what I understand I should be using the clusterInstance()
method for this, but to my surprise, when taking a look at the code of that method, it looks like the implementation ignores the parameter:
/**
* Classifies a given instance.
*
* @param instance The instance to be assigned to a cluster
* @return int The number of the assigned cluster as an integer
* @throws java.lang.Exception If instance could not be clustered
* successfully
*/
public int clusterInstance(Instance instance) throws Exception {
if (processed_InstanceID >= database.size()) processed_InstanceID = 0;
int cnum = (database.getDataObject(Integer.toString(processed_InstanceID++))).getClusterLabel();
if (cnum == DataObject.NOISE)
throw new Exception();
else
return cnum;
}
This doesn't seem right. How is that supposed to work? Is there a different method I should be using for clustering? Do I have to run this method sequentially on all instances, in some specific order, if I want to get any useful information out of it?
As Mark answered, this is obviously a bug. As long as you query about instances in the exact same order in which they were inserted into the clusterer it's okay; but it won't work in any other case.
A co-worker solved this by writing her own version of the DBScan class: essentially identical (copy-pasted), except that she maintains a mapping between instances and cluster labels. This mapping can be produced by iterating over the contents of the database
instance. The appropriate cluster for an instance can then be immediately retrieved from that mapping.
Editing this method is also a good opportunity to change the throw new Exception
into something more sensible in this context, such as return -1
.