I am using ELKI to mine some geospatial data (lat,long pairs) and I am quite concerned on using the right data types and algorithms. On the parameterizer of my algorithm, I tried to change the default distance function by a geo function (LngLatDistanceFunction, as I am using x,y data) as bellow:
params.addParameter (DISTANCE_FUNCTION_ID, geo.LngLatDistanceFunction.class);
However the results are quite surprising: it creates clusters of a repeated point, such as the example bellow:
(2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN)]
This is an image of this example.
Whether I used a non-geo distance (for instance manhattan):
params.addParameter (DISTANCE_FUNCTION_ID, geo.minkowski.ManhattanDistanceFunction.class);
,the output is much more reasonable
I wonder if there is something wrong with my code.
I am running the algorithm directly on the db, like this:
Clustering<Model> result = dbscan.run(db);
And then iterating over the results in a loop, while I construct the convex hulls:
for (de.lmu.ifi.dbs.elki.data.Cluster<?> cl : result.getAllClusters()) {
if (!cl.isNoise()){
Coordinate[] ptList=new Coordinate[cl.size()];
int ct=0;
for (DBIDIter iter = cl.getIDs().iter();
iter.valid(); iter.advance()) {
ptList[ct]=dataMap.get(DBIDUtil.toString(iter));
++ct;
}
GeoPolygon poly=getBoundaryFromCoordinates(ptList);
if (poly.getCoordinates().getGeometryType()==
"Polygon"){
out.write(poly.coordinates.toText()+"\n");
}
}
}
To map each ID to a point, I use a hashmap, that I initialized when reading the database. The reason why I am adding this code, is because I suspect that I may doing something wrong regarding the structures that I am passing/reading to/from the algorithm. I thank you in advance for any comments that could help me to solve this. I find ELKI a very efficient and sophisticated library, but I have trouble to find examples that illustrate simple cases, like mine.
What is your epsilon
value?
Geographic distance is in meters in ELKI (if I recall correctly); Manhattan distance would be in latitude + longitude degrees. For obvious reasons, these live on very different scales, and therefore you need to choose a different epsilon value.
In your previous questions, you used epsilon=0.008
. For geodetic distance, 0.008 meters = 8 millimeter.
At epsilon = 8 millimeter, I am not surprised if the clusters you get consist only of duplicated coordinates. Any chance that above coordinates do exist multiple times in your data set?