I am trying to cluster some geospatial data, and I previously tried the WEKA library. I found this benchmarking, and decided to try ELKI.
Despite the advice to not use ELKI as a Java library (which is suppose to be less maintained than the UI), I incorporated it in my application, and I can say that I am quite happy about the results. The structures that it uses to store data, are far more efficient than the ones used by Weka, and the fact that it has the option of using a spatial index is definetly a plus.
However, when I compare the results of Weka's DBSCAN, with the ones from ELKI's DBSCAN, I get a little bit puzzled. I would accept different implementations can give origin to slightly different results, but these magnitude of difference makes me think there is something wrong with the algorithm (probably with my code). The number of clusters and their geometry is very different in the two algorithms.
For the record, I am using the latest version of ELKI (0.6.0), and the parameters I used for my simulations were:
minpts=50 epsilon=0.008
I coded two DBSCAN functions (for Weka and ELKI), where the "entry point" is a csv with points, and the "output" for both of them is also identical: a function that calculates the concave hull of a set of points (one for each cluster). Since the function that reads the csv file into an ELKI "database" is relatively simple, I think my problem could be:
a) in the parametrization of the algorithm; b) reading the results (most likely).
Parametrizing DBSCAN does not pose any challenges, and I use the two compulsory parameters, which I previously tested through the UI:
ListParameterization params2 = new ListParameterization();
params2.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.MINPTS_ID, minPoints);
params2.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.EPSILON_ID, epsilon);
Reading the result is a bit more challenging, as I don't completely understand the organization of the structure that stores the clusters; My idea is to iterate over each cluster, get the list of points, and pass it to the function that calculates the concave hull, in order to generate a polygon.
ArrayList<Clustering<?>> cs = ResultUtil.filterResults(result, Clustering.class);
for (Clustering<?> c : cs) {
System.out.println("clusters: " + c.getAllClusters().size());
for (de.lmu.ifi.dbs.elki.data.Cluster<?> cluster : c.getAllClusters()) {
if (!cluster.isNoise()){
Coordinate[] ptList=new Coordinate[cluster.size()];
int ct=0;
for (DBIDIter iter = cluster.getIDs().iter(); iter.valid(); iter.advance()) {
ptList[ct]=dataMap.get(DBIDUtil.toString(iter));
++ct;
}
//there are no "empty" clusters
assertTrue(ptList.length>0);
GeoPolygon poly=getBoundaryFromCoordinates(ptList);
if (poly.getCoordinates().getGeometryType()==
"Polygon"){
try {
out.write(poly.coordinates.toText()+"\n");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}else
System.out.println(
poly.getCoordinates().getGeometryType());
}//!noise
}
}
I notice that the "noise" was coming up as a cluster, so I ignored this cluster (I don't want to draw it). I am not sure if this is the right way of reading the clusters, as I don't find many examples. I also have some questions, for which I did not found answers yet:
Any comments that could point me in the right direction, or any code suggestions to iterate over the result set of ELKI's DBSCAN would be really welcome! I also used ELKI's OPTICSxi in my code, and I have even more questions regarding those results, but I guess I'll save that for another post.
Accessing the DBIDs
of ELKI works, if you pay attention to how they are assigned.
For a static database, getDBIDs()
will return a RangeDBIDs
object, and it can give you an offset into the database. This is very reliable. But if you always restart your process, the DBIDs
will be assigned deterministically anyway (only when using the MiniGUI, they will differ if you rerun a job!)
This will also be more efficient than DBIDUtil.toString
.
DBSCAN results are not hierarchical, so every cluster should be a top level cluster.
As for Weka, it sometimes does automatic normalization. Then the epsilon value will be distorted. For geographic data, I would prefer geodetic distance anyway, Euclidean distance on latitude and longitude does not make sense.
Check this part of Wekas code: "norm" function, used by EuclideanDataObject. This does look to me as if Wekas DBSCAN enforces a normalization on the data set! Try scaling your data to [0:1] (I'm pretty sure there is a filter for this in ELKI), if the results are identical afterwards?
Judging from this code snippet, I would blame Weka. The code above also looks a bit inefficient to me. The filter approach makes IMHO more sense than this enforced filtering in the data objects.