Problem: It appears to me that a fundamental property of a clustering method c()
is whether we can combine the results c(A)
and c(B)
by some function f()
of two clusterings in a way that we do not have to apply the full clustering c(A+B)
again but instead do f(c(A),c(B))
and still end up with the same result:
c(A+B) == f(c(A),c(B))
I suppose that a necessary condition for some c()
to have this property is that it is determistic, that is the order of its internal processing is irrelevant for the result. However, this might not be sufficient.
It would be really nice to have some reference where to look up which cluster methods support this and what a good f()
looks like in the respective case.
Example: At the moment I am thinking about DBSCAN which should be deterministic if I allow border points to belong to multiple clusters at the same time (without connecting them):
If you miss the noise points then assume that each core node reaches itself (reflexivity) and afterwards we define noise points to be clusters of size one. Border points are non-core points. Afterwards if we want a partitioning, we can assign randomly the border points that are in multiple clusters to one of them. I do not consider this relevant for the method itself.
Supposedly the only clustering where this is efficiently possible is single linkage hierarchical clustering, because edges removed from A x A and B x B are not necessary for finding the MST of the joined set.
For DBSCAN precisely, you have the problem that the core point property can change when you add data. So c(A+B) likely has core points that were not core in either A not B. This can cause clusters to merge. f() supposedly needs to re-check all data points, i.e., rerun DBSCAN. While you can exploit that core points of the subset must be core of the entire set, you'll still need to find neighbors and missing core points.