Search code examples
javac#wekagraph-theoryikvm

WEKA Hierarchical Clustering Output - Leaf identification ambiguity


When calling Hierarchical clustering from WEKA (I am using IKVM from C#, but I don't believe that it is important, answer can be in either language), there is an option to generate the dendrogram in Newick format, but when trying to parse it, I need to identify leaves and link each leave to one datum (vector) in the input.

For example, the input arff is:

@RELATION points


@ATTRIBUTE x REAL
@ATTRIBUTE y REAL

@DATA
1.0,2.0
3.0,1.0
1.0,3.0
2.0,1.0

I would get the following dendrogram in Newick format:

((2.0:1,3.0:1):1.49661,(1.0:1,1.0:1):1.49661)

Where it is not clear how points are identified (the first branch has 2 and 3, but the second branch has 1 and 1, but it is not clear which one is which).

Is there a way to change the way this output is represented, or to add an extra unique attribute identify datums in a better way in the Newick output?


Solution

  • Found the solution, it might not work with all distance functions, but it works with the default config of Weka Hierarchical Clustering: The solution is just to add an extra string attribute at the end, which seems to be ignored in all calculations, this can contain a unique identification of the row or vector, this will be used by WEKA to output the final graph (Newick dendrogram).

    Example ARFF:

    @RELATION points
    
    @ATTRIBUTE x REAL
    @ATTRIBUTE y REAL
    @ATTRIBUTE id   STRING
    
    @DATA
    1,5,100 
    2,6,200
    3,5,300
    

    This will result in the following Newick:

    (((100:1.41421,200:1.41421):-0.05358,300:1.36064):0.441,400:1.80164)
    

    And when ignoring the last attribute, this will result in the same exact clusters, but with a different naming for the leaves:

    (((5.0:1.41421,6.0:1.41421):-0.05358,5.0:1.36064):0.441,6.0:1.80164)
    

    Which is ambiguous.