Search code examples
javacluster-analysisdata-miningelki

Evaluation of precomputed clustering using ELKI in Java


I already have computed clusters and want to use ELKI library only to perform evaluation on this clustering.

So I have data in this form:

0.234 0.923 cluster_1 true_cluster1
0.543 0.874 cluster_2 true_cluster3
...

I tried to:

  1. Create 2 databases: with result labels and with reference labels:

    double [][] data;
    String [] reference_labels, result_labels;
    
    DatabaseConnection dbc1 = new ArrayAdapterDatabaseConnection(data, result_labels);
    Database db1 = new StaticArrayDatabase(dbc1, null);
    
    DatabaseConnection dbc2 = new ArrayAdapterDatabaseConnection(data, reference_labels);
    Database db2 = new StaticArrayDatabase(dbc2, null);
    
  2. Perform ByLabel Clustering for each database:

    Clustering<Model> clustering1 = new ByLabelClustering().run(db1);
    Clustering<Model> clustering2 = new ByLabelClustering().run(db2);
    
  3. Use ClusterContingencyTable for comparing clusterings and getting measures:

    ClusterContingencyTable ct = new ClusterContingencyTable(true, false);
    ct.process(clustering1, clustering2);
    PairCounting paircount = ct.getPaircount();
    

The problem is that measuers are not computed.
I looked into source code of ContingencyTable and PairCounting and it seems that it won't work if clusterings come from different databases and a database can have only 1 labels relation.
Is there a way to do this in ELKI?


Solution

  • You can modify the ByLabelClustering class easily (or implement your own) to only use the first label, or only use the second label; then you can use only one database.

    Or you use the 3-parameter constructor:

    DatabaseConnection dbc1 = new ArrayAdapterDatabaseConnection(data, result_labels, 0);
    Database db1 = new StaticArrayDatabase(dbc1, null);
    
    DatabaseConnection dbc2 = new ArrayAdapterDatabaseConnection(data, reference_labels, 0);
    Database db2 = new StaticArrayDatabase(dbc2, null);
    

    so that the DBIDs are the same. Then ClusterContingencyTable should work.

    By default, ELKI would continue enumerating objects, so the first database would have IDs 1..n, and the second n+1..2n. But in order to compare clusterings, they need to contain the same objects, not disjoint sets.