Search code examples
mongodbcluster-analysisdbscan

Best programming language to implement DBSCAN algorithm querying a MongoDB database?


I've to implement the DBSCAN algorithm. Assuming to start from this pseudocode

DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
      mark P as visited
      NeighborPts = regionQuery(P, eps)
      if sizeof(NeighborPts) < MinPts
         mark P as NOISE
      else
         C = next cluster
         expandCluster(P, NeighborPts, C, eps, MinPts)

expandCluster(P, NeighborPts, C, eps, MinPts)
   add P to cluster C
   for each point P' in NeighborPts 
      if P' is not visited
         mark P' as visited
         NeighborPts' = regionQuery(P', eps)
         if sizeof(NeighborPts') >= MinPts
            NeighborPts = NeighborPts joined with NeighborPts'
      if P' is not yet member of any cluster
         add P' to cluster C

regionQuery(P, eps)
   return all points within P's eps-neighborhood

My code has to run on an Amazon EC2 Instance with Ubuntu Linux 64 bit.

The function regionQuery queries a MongoDB database to obtain all points within P's eps-neighborhood.

So, according to you, what is the best programming language to implement it to improve performances? C, PHP, Java (I don't think)?


Solution

  • I assume that you have a lot of points and need results fast - otherwise you can use almost anything.

    It seems like map-reduce job for me

    Map part would be loop "for each unvisited point" and should emit data construct containing neighbors, candidate clusters and whatever else. In case point is classified as noise it should emit nothing.

    Cluster expansion shall go into reduce and possibly finalize part - also language choice would be javascript and everything would happen inside mongo