Search code examples
javak-meansdata-analysis

java - k-means clustering


I have the following input integer vectors (example):

4 138 233 461 610 621 669 742 814 827
89 138 334 656 697 810
138
138 196 738
659 738
4 461
138 337 756 810
8 138 196 337 468 663 664 756 809 810

They all contain integer values [1-850] and are all stored in a csv file.

I want to divide them into multiple clusters based on similarities in the vectors, but I'm confused about how exactly to implement a k-means algorithm for my input data in java. Anyone willing to help out with tips or code?

Thanks in advance.


Solution

  • Pseudo-code for k-means clustering

    assuming you have a metric (let's call this M) which can compare input objects (in your case vectors) and output a measure of similarity.

    and a function (let's call this A) which is capable of calculating the average of a collection of input objects

    1. randomly select N items from your dataset. They are the new centers of the clusters (called centroids).
    2. for each item X that isn't a centroid, calculate its distance to each centroid (using M), and mark it as belonging to that centroid C where the distance between X and C (using M) is smallest.
    3. Each item is now assigned to a centroid.
    4. Use the averaging function (A) to calculate the new centroids
    5. Either use the output of A directly as the new centroids, or find actual items that are closest to the output of A (using M)
    6. repeat steps 2 till 5 until convergence (or until your computational budget has been spent)

    Also check out https://en.wikipedia.org/wiki/K-means_clustering