Search code examples
javaelki

ELKI for OPTICS Xi - Can I make it go faster?


I am new to ELKI, and I've successfully tuned the algorithm I'd like to run. I used it on 3K coordinates and it was very fast - so now I'm trying to scale up to around 1 MM records. Right now I'm running on 30K but it has been hours and it's still running.

Is there any way I can boost performance? I noticed java.exe *32 is only using ~13% CPU and 150KB memory (machine is 2.8 GHz i7 with 32 GB RAM)

I used pagesize 1024 based on someone else's prior suggestion for working with only 2 dimensions (lon/lat)

Running directly from Windows command line:

java -jar <path> cli 
-algorithm clustering.optics.OPTICSXi
-opticsxi.xi 0.006
-optics.minpts 5
-dbc.in <path> 
-db.index tree.spatial.rstarvariants.rstar.RStarTreeFactory 
-pagefile.pagesize 1024 
-spatial.bulkstrategy SortTileRecursiveBulkSplit 
-algorithm.distancefunction geo.LngLatDistanceFunction 
-geo.model WGS84SpheroidEarthModel 
-opticsxi.algorithm OPTICSHeap 
-resulthandler ResultWriter 
-out <path>

Solution

  • The runtime of OPTICS relates to the selectivity of the query.

    With radius infinity, performance will be O(n^2).

    Try to choose -optics.epsilon as small as your application permits. The smaller, the faster will OPTICS be (with an index). However, if you use a too small value (say, 1 meter) then you may lose the large-scale structure of your data. With geographic data, you do have distances of 20,000,000 meters. But in many applications, points on other contients matter little, and a radius of 10,000 m or 100,000 m yields a substantial speedup.

    If your data is noisy, you may want to increase minPts to e.g. 10 or 20 for your largest data set.