I've to implement the DBSCAN algorithm. Assuming to start from this pseudocode
DBSCAN(D, eps, MinPts)
C = 0
for each unvisited point P in dataset D
mark P as visited
NeighborPts = regionQuery(P, eps)
if sizeof(NeighborPts) < MinPts
mark P as NOISE
else
C = next cluster
expandCluster(P, NeighborPts, C, eps, MinPts)
expandCluster(P, NeighborPts, C, eps, MinPts)
add P to cluster C
for each point P' in NeighborPts
if P' is not visited
mark P' as visited
NeighborPts' = regionQuery(P', eps)
if sizeof(NeighborPts') >= MinPts
NeighborPts = NeighborPts joined with NeighborPts'
if P' is not yet member of any cluster
add P' to cluster C
regionQuery(P, eps)
return all points within P's eps-neighborhood
My code has to run on an Amazon EC2 Instance with Ubuntu Linux 64 bit.
The function regionQuery queries a MongoDB database to obtain all points within P's eps-neighborhood.
So, according to you, what is the best programming language to implement it to improve performances? C, PHP, Java (I don't think)?
I assume that you have a lot of points and need results fast - otherwise you can use almost anything.
It seems like map-reduce job for me
Map part would be loop "for each unvisited point" and should emit data construct containing neighbors, candidate clusters and whatever else. In case point is classified as noise it should emit nothing.
Cluster expansion shall go into reduce and possibly finalize part - also language choice would be javascript and everything would happen inside mongo