algorithm optimization parallel-processing cluster-analysis computer-science

Clustering algorithm and "extending" clusters to include N nearest neighbours

This sounds like a trivial problem, but I couldn't find anything online.

We have a set of elements a b c d e. For those elements pair-wise distances are defined. Each element needs to be processed. In order to process an element - N of it's nearest neighbours are needed.

Problem: how to break those elements in to M sets of roughly equal size, and then extend those sets, so that each element inside the set would have N nearest neighbours in the extended set. This can be used for parallelising processing of original elements.

I'm using Spark - but this probably can be abstracted to any parallel computation.

Here's an example:

We have following elements, the distance between them is just their difference.

N = 4 # number of nearest neighbours required for the computation
M = 2 # desired number of clusters

elements:
  1 2 3 4 5 6

basic clusters:
  1 2 3
  4 5 6

extended clusters:
  1 2 3 (4 5)
  4 5 6 (2 3)

How is this called, is there some general approach to this kind of a problem? My understanding is that this isn't strictly clustering.

This algorithm (clustering + extending) will be running on a single node, then bulk of the data will be joined and process in a distributed system.

Solution

In a first step, a simple greedy algorithm could be tested.

I have the feeling that it is more logical to determine the overlapping (extended) sets, and then to determine the non extended ones.

Let us select K (= M ?) points as distant as possible from all the others.
I assume here that selecting such extremal points is feasible, 1 and 6 in your example.
Note that the initial number of points could be lower than M.

These initial points Pi determine K sets Si.
Then, each Si is completed by the needed neighbors of Pi.
For each set, we can determine the number of points that have enough neighbors.
If K < M, we can determine M-K point as far as possible of the previous sets and set up new sets with these points and their neighbors.
if all points are in a set with all their neighbors: STOP.
Select a set with the lowest number of satisfied points, i.e. with all their neighbors. In this set, determine the points with the lowest number of missing neighbors. Select randomly such a point, and complete the set with the missing neighbors of this point
go to step 3 until the stopping criteria is satisfied.

A variant is to continue the process until each set has the required number of satisfied points.

Each random process may provide a different solution. Several attempts can be performed in parallel on different nodes.

In your simple example, the process provides the solution immediately :

S0 = {1} -> S0 = {1, 2, 3, 4, 5}
S1 = {6} -> S1 = {6, 5, 4, 3, 2}

It may happen that two different sets have a same satisfied point. Even if this point must stay in each exented set, it could be removed from one of non-extended set

One justification of this algorithm : I take for granted that extreme points must be in different non-extended sets. This implies that their neighbors must be present in the corresponded extended set Si.