I have a dataset of points of interest on the maps like the following:
ID latitude longitude
1 48.860294 2.338629
2 48.858093 2.294694
3 48.8581965 2.2937403
4 48.8529717 2.3477134
...
The goal is to find those clusters of points that are very close to each other (distance less than 100m). So the output I expect for this dataset would be:
(2, 3)
The point 2 and 3 are very close to each other with a distance less than 100m, while the others are far away so they should be ignored.
Since the dataset is huge with all the points of interest in the world, I need to do it with Spark with some parallel processing. What approach should I take for this case?
I actually solved this problem using the following 2 approaches:
DBSCAN algorithm implemented as Spark job with partitioning
https://github.com/irvingc/dbscan-on-spark
GeoSpark with spacial distance join
https://github.com/DataSystemsLab/GeoSpark
both of them are based on Spark so they work well with large scale of data. however I found the dbscan-on-spark consumes a lot of memory, so I ended up using the GeoSpark with distance join.