Search code examples
cluster-analysiswekaxmeans

Getting Xmeans clusterer output programmatically in Weka


When using Kmeans in Weka, one can call getAssignments() on the resulting output of the model to get the cluster assignment for each given instance. Here's a (truncated) Jython example:

>>>import weka.clusterers.SimpleKMeans as kmeans
>>>kmeans.buildClusterer(data)
>>>assignments = kmeans.getAssignments()
>>>assignments
>>>array('i',[14, 16, 0, 0, 0, 0, 16,...])

The index of each cluster number corresponds to the instance. So, instance 0 is in cluster 14, instance 1 is in cluster 16, and so on.

My question is: Is there something similar for Xmeans? I've gone through the entire API here and don't see anything like that.


Solution

  • Here's a reply to my question from the Weka listserv:

     "Not as such. But all clusterers have a clusterInstance() method. You can 
     pass each training instance through the trained clustering model to 
     obtain the cluster index for each."
    

    Here's my Jython implementation of this suggestion:

     >>> import java.io.FileReader as FileReader
     >>> import weka.core.Instances as Instances
     >>> import weka.clusterers.XMeans as xmeans
     >>> import java.io.BufferedReader as read
     >>> import java.io.FileReader
     >>> import java.io.File
     >>> read = read(FileReader("some arff file"))
     >>> data = Instances(read)
     >>> file = FileReader("some arff file")
     >>> data = Instances(file)
     >>> xmeans = xmeans()
     >>> xmeans.setMaxNumClusters(100)  
     >>> xmeans.setMinNumClusters(2) 
     >>> xmeans.buildClusterer(data)# here's our model 
     >>> enumerated_instances = data.enumerateInstances() #get the index of each instance 
     >>> for index, instance in enumerate(enumerated_instances):
             cluster_num = xmeans.clusterInstance(instance) #pass each instance through the model
             print "instance # ",index,"is in cluster ", cluster_num #pretty print results
    
     instance # 0 is in cluster  1
     instance # 1 is in cluster  1
     instance # 2 is in cluster  0
     instance # 3 is in cluster  0
    

    I'm leaving all of this up as a reference, since the same approach could be use to get cluster assignments for the results of any of Weka's clusterers.