Search code examples
scalaapache-sparkcluster-analysisk-meansapache-spark-mllib

Spark MLlib K-Means Clustering


I have some geographical points defined with latitude, longitude and score and I want to use MLlib K-Means algorithm to make clusters. Is that available with MLlib K-Means and if available, how can I pass the parameters or features to the algorithm .. as far as I found, it reads a text file of double datatype and make clusters based on it.


Solution

  • Do not use k-means on latitude longitude data

    Because of distortion. Earth is a sphere, and -180° and +180° do not have a distance of 360°. But even if you are well away from the data line, e.g. all your data is in San Francisco, at Latitude ~37.773972, you have a distortion of over 20%, and this gets worse the further north you go.

    Use an algorithm such as HAC or DBSCAN that can be used (in a good implementation, there are many bad implementations) with Haversine distance. For example ELKI has very fast clustering algorithms, and allows different geo-distances. Even with index acceleration, with helps a lot with geo points.

    See also this blog post: https://doublebyteblog.wordpress.com/2014/05/16/clustering-geospatial-data/