Search code examples
c#cluster-analysisk-meansdata-analysis

Creating clusters for data based on proximity of data points in C#


I have a collection of data points contained in List<Point4D> allPoints where each Point4D point is represented by a node containing its x,y,z location in space (point.X , point.Y , point.Z) and its magnitude value ( point.W ). The data points represent individual points of stress on an object, and therefore there are various clusters of data points on the object in which the data points are in close proximity and have similar magnitudes.

I want to be able to identify where these clusters are and which data points they include. The user needs to be able to see the clusters and will (eventually) be able to filter them based on size/number of points/stress value magnitude, etc (this is not my main concern right now).

For now, I'd just like to be able to generate a sort of "bubble" around the data points included in each cluster, so that I can display each cluster individually.

I have tried implementing K-means but got stuck as I needed to know how many clusters there were beforehand (at least, this was a requirement in all the implementations I've found). For my purposes, I will not know how many clusters there are or where they are beforehand; this information varies depending on the current data set being analyzed (the data is imported from a .csv file uploaded by the user).

Any ideas would be greatly appreciated!


Solution

  • Thr usual way is to run k-means several times for different k, and pick the "best" by some heuristic such as the (stupid) elbow method. Better choices include VRC, but it should be very clear that there is no universally best kz and your application may be an example where you will likely want a larger k than the "best" found by such methods.

    Also there are variants such as x-means and g-means that try to"learn k" during clustering, largely by trying to split clusters as long as some heuristic improves.