This sounds like a trivial problem, but I couldn't find anything online.
We have a set of elements a b c d e
. For those elements pair-wise distances are defined. Each element needs to be processed. In order to process an element - N of it's nearest neighbours are needed.
Problem: how to break those elements in to M
sets of roughly equal size, and then extend those sets, so that each element inside the set would have N
nearest neighbours in the extended set. This can be used for parallelising processing of original elements.
I'm using Spark - but this probably can be abstracted to any parallel computation.
Here's an example:
We have following elements, the distance between them is just their difference.
N = 4 # number of nearest neighbours required for the computation
M = 2 # desired number of clusters
elements:
1 2 3 4 5 6
basic clusters:
1 2 3
4 5 6
extended clusters:
1 2 3 (4 5)
4 5 6 (2 3)
How is this called, is there some general approach to this kind of a problem? My understanding is that this isn't strictly clustering
.
This algorithm (clustering + extending) will be running on a single node, then bulk of the data will be joined and process in a distributed system.
In a first step, a simple greedy algorithm could be tested.
I have the feeling that it is more logical to determine the overlapping (extended) sets, and then to determine the non extended ones.
Let us select K (= M ?) points as distant as possible from all the others.
I assume here that selecting such extremal points is feasible, 1
and 6
in your example.
Note that the initial number of points could be lower than M.
A variant is to continue the process until each set has the required number of satisfied points.
Each random process may provide a different solution. Several attempts can be performed in parallel on different nodes.
In your simple example, the process provides the solution immediately :
It may happen that two different sets have a same satisfied point. Even if this point must stay in each exented set, it could be removed from one of non-extended set
One justification of this algorithm : I take for granted that extreme points must be in different non-extended sets. This implies that their neighbors must be present in the corresponded extended set Si.