Search code examples
streamcluster-analysisk-meansmoa

MOA CluStream: What should we "name" the micro clusters that do not lie inside any of the macro clusters after k means is computed?


I am currently studying CluStream, and I have some doubts regarding the results. I will proceed to explain:

If the micro clusters are clustered using K means, we all know that every micro cluster will belong to the closest macro cluster (computing the euclidean distance between the centers).

Now, looking at the following sample result:

enter image description here

we can see that the macro clusters do not group all the micro clusters …

What does this mean? How should we consider the micro clusters that do not lie inside some macro cluster? Should I find every micro cluster closest macro one to label them?

EDIT:

Checking the MOA source code on Github, I found that the macro clusters radius is calculated multiplying the deviation AVG by the so called ‘radius factor’ (which value is fixed at 1.8). However, when I ask the macro clusters for their weights, if a huge time window is used and there is not a fading component, I can see that the macro clusters resume the information of all the points ... all the current micro clusters are considered! So, even if we see some micro clusters that stay out of the macro clusters spheres, we know that they belong to the closest one - it's K means after all!

So, I still have a question: why calculating the macro clusters radius that way? I mean, what does it represent? Should not the algorithm return the labeled micro clusters instead?

Any feedback is welcomed. TIA!


Solution

  • The key question is: what does the user need?

    Labeling micro-clusters is okay, but where is the use for the user?

    In most cases, all that people use of the k-means result are the cluster centers. Because the entire objective of k-means is essentially "find the best k-point approximation to the data".

    So likely all the information users of CluStream are going to use are the k current cluster centers. maybe the weights each, and their age.