Search code examples
apache-sparkcomputational-geometry

Spark Geolocated Points Clustering


I have a dataset of points of interest on the maps like the following: ID latitude longitude 1 48.860294 2.338629 2 48.858093 2.294694 3 48.8581965 2.2937403 4 48.8529717 2.3477134 ...

The goal is to find those clusters of points that are very close to each other (distance less than 100m). So the output I expect for this dataset would be:

(2, 3)

The point 2 and 3 are very close to each other with a distance less than 100m, while the others are far away so they should be ignored.

Since the dataset is huge with all the points of interest in the world, I need to do it with Spark with some parallel processing. What approach should I take for this case?


Solution

  • I actually solved this problem using the following 2 approaches:

    DBSCAN algorithm implemented as Spark job with partitioning
    https://github.com/irvingc/dbscan-on-spark

    GeoSpark with spacial distance join
    https://github.com/DataSystemsLab/GeoSpark

    both of them are based on Spark so they work well with large scale of data. however I found the dbscan-on-spark consumes a lot of memory, so I ended up using the GeoSpark with distance join.