scala apache-spark cluster-analysis k-means apache-spark-mllib

Spark MLlib K-Means Clustering

I have some geographical points defined with latitude, longitude and score and I want to use MLlib K-Means algorithm to make clusters. Is that available with MLlib K-Means and if available, how can I pass the parameters or features to the algorithm .. as far as I found, it reads a text file of double datatype and make clusters based on it.

Solution

Do not use k-means on latitude longitude data

Because of distortion. Earth is a sphere, and -180° and +180° do not have a distance of 360°. But even if you are well away from the data line, e.g. all your data is in San Francisco, at Latitude ~37.773972, you have a distortion of over 20%, and this gets worse the further north you go.

Use an algorithm such as HAC or DBSCAN that can be used (in a good implementation, there are many bad implementations) with Haversine distance. For example ELKI has very fast clustering algorithms, and allows different geo-distances. Even with index acceleration, with helps a lot with geo points.