Search code examples
cluster-analysisdata-mininggeospatialdbscanelki

Using a Geo Distance Function on ELKI


I am using ELKI to mine some geospatial data (lat,long pairs) and I am quite concerned on using the right data types and algorithms. On the parameterizer of my algorithm, I tried to change the default distance function by a geo function (LngLatDistanceFunction, as I am using x,y data) as bellow:

params.addParameter (DISTANCE_FUNCTION_ID,  geo.LngLatDistanceFunction.class);

However the results are quite surprising: it creates clusters of a repeated point, such as the example bellow:

(2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN)]

This is an image of this example.

Whether I used a non-geo distance (for instance manhattan):

params.addParameter (DISTANCE_FUNCTION_ID,  geo.minkowski.ManhattanDistanceFunction.class);

,the output is much more reasonable

I wonder if there is something wrong with my code.

I am running the algorithm directly on the db, like this:

         Clustering<Model> result = dbscan.run(db); 

And then iterating over the results in a loop, while I construct the convex hulls:

   for (de.lmu.ifi.dbs.elki.data.Cluster<?> cl : result.getAllClusters()) {
               if (!cl.isNoise()){
                     Coordinate[] ptList=new Coordinate[cl.size()];
                        int ct=0;               

                        for (DBIDIter iter = cl.getIDs().iter(); 
                                iter.valid(); iter.advance()) {
                                ptList[ct]=dataMap.get(DBIDUtil.toString(iter));                                                                                                                                            
                                ++ct;                                                                   
                        }       

                        GeoPolygon poly=getBoundaryFromCoordinates(ptList);
                        if (poly.getCoordinates().getGeometryType()==
                        "Polygon"){                                                     
                            out.write(poly.coordinates.toText()+"\n");
                        }                      
               }
            }            

To map each ID to a point, I use a hashmap, that I initialized when reading the database. The reason why I am adding this code, is because I suspect that I may doing something wrong regarding the structures that I am passing/reading to/from the algorithm. I thank you in advance for any comments that could help me to solve this. I find ELKI a very efficient and sophisticated library, but I have trouble to find examples that illustrate simple cases, like mine.


Solution

  • What is your epsilon value?

    Geographic distance is in meters in ELKI (if I recall correctly); Manhattan distance would be in latitude + longitude degrees. For obvious reasons, these live on very different scales, and therefore you need to choose a different epsilon value.

    In your previous questions, you used epsilon=0.008. For geodetic distance, 0.008 meters = 8 millimeter.

    At epsilon = 8 millimeter, I am not surprised if the clusters you get consist only of duplicated coordinates. Any chance that above coordinates do exist multiple times in your data set?