Search code examples
javacluster-analysisdbscanapache-commons-math

Apache DBSCANClusterer always return 0 clusters


I'm trying to use DBSCANClusterer from apache.commons.math3.ml.clustering. Function cluster returns list of clusters but for me size of list is always 0. What am I doing wrong? Below is my test code:

public class ClusterTest {

    public static void main(String[] args) throws FileNotFoundException, IOException {
        DBSCANClusterer dbscan = new DBSCANClusterer(.05, 15);
        List<DoublePoint> points = getData();
        List<Cluster<DoublePoint>> cluster = dbscan.cluster(points);
        for(Cluster<DoublePoint> p : cluster)
            System.out.println(p.getPoints().toString());                             
    }

    private static List<DoublePoint> getData() throws FileNotFoundException, IOException {
        List<DoublePoint> data = new ArrayList<DoublePoint>();      
        BufferedReader reader = new BufferedReader(new FileReader(new File("clust.txt")));
        String line;
        double[] d = new double[2];
        while ((line = reader.readLine()) != null) {
            try {                   
                String[] l = line.split("\t");
                d[0] = Double.parseDouble(l[0]);
                d[1] = Double.parseDouble(l[1]);
                data.add(new DoublePoint(d));
            } catch (Exception e) { }
        }       
        return data;
    }
}

File clust.txt contains two columns with X and Y values separated with tabulator. I tried with a few different data and I always get 0.


Solution

  • Try the version in ELKI instead. Apache commons math is unfortunately not very good. I moved away from commons-math because of various small issues. ELKI works much better for me.

    From a quick look, commons-math is still pretty dead when it comes to cluster analysis... it was last touched for MATH-917. The DBSCAN code there is still quite inefficient. In the previous version, DBSCAN was using all deprecated classes. But it has received like 4 commits over x years.

    If you don't get any clusters, you probably have a too small epsilon, and a too high value of minPts... and the commons-math implementation of DBSCAN loses all noise objects - which is what you probably are getting: all noise.