WEKA Hierarchical Clustering Output - Leaf identification ambiguity

When calling Hierarchical clustering from WEKA (I am using IKVM from C#, but I don't believe that it is important, answer can be in either language), there is an option to generate the dendrogram in Newick format, but when trying to parse it, I need to identify leaves and link each leave to one datum (vector) in the input.

For example, the input arff is:

@RELATION points


@ATTRIBUTE x REAL
@ATTRIBUTE y REAL

@DATA
1.0,2.0
3.0,1.0
1.0,3.0
2.0,1.0

I would get the following dendrogram in Newick format:

((2.0:1,3.0:1):1.49661,(1.0:1,1.0:1):1.49661)

Where it is not clear how points are identified (the first branch has 2 and 3, but the second branch has 1 and 1, but it is not clear which one is which).

Is there a way to change the way this output is represented, or to add an extra unique attribute identify datums in a better way in the Newick output?

Solution

Found the solution, it might not work with all distance functions, but it works with the default config of Weka Hierarchical Clustering: The solution is just to add an extra string attribute at the end, which seems to be ignored in all calculations, this can contain a unique identification of the row or vector, this will be used by WEKA to output the final graph (Newick dendrogram).

Example ARFF:

@RELATION points

@ATTRIBUTE x REAL
@ATTRIBUTE y REAL
@ATTRIBUTE id   STRING

@DATA
1,5,100 
2,6,200
3,5,300

This will result in the following Newick:

(((100:1.41421,200:1.41421):-0.05358,300:1.36064):0.441,400:1.80164)

And when ignoring the last attribute, this will result in the same exact clusters, but with a different naming for the leaves:

(((5.0:1.41421,6.0:1.41421):-0.05358,5.0:1.36064):0.441,6.0:1.80164)

Which is ambiguous.