Search code examples
algorithmoptimizationparallel-processingcluster-analysiscomputer-science

Clustering algorithm and "extending" clusters to include N nearest neighbours


This sounds like a trivial problem, but I couldn't find anything online.

We have a set of elements a b c d e. For those elements pair-wise distances are defined. Each element needs to be processed. In order to process an element - N of it's nearest neighbours are needed.

Problem: how to break those elements in to M sets of roughly equal size, and then extend those sets, so that each element inside the set would have N nearest neighbours in the extended set. This can be used for parallelising processing of original elements.

I'm using Spark - but this probably can be abstracted to any parallel computation.

Here's an example:

We have following elements, the distance between them is just their difference.

N = 4 # number of nearest neighbours required for the computation
M = 2 # desired number of clusters

elements:
  1 2 3 4 5 6

basic clusters:
  1 2 3
  4 5 6

extended clusters:
  1 2 3 (4 5)
  4 5 6 (2 3)

How is this called, is there some general approach to this kind of a problem? My understanding is that this isn't strictly clustering.

This algorithm (clustering + extending) will be running on a single node, then bulk of the data will be joined and process in a distributed system.


Solution

  • In a first step, a simple greedy algorithm could be tested.

    I have the feeling that it is more logical to determine the overlapping (extended) sets, and then to determine the non extended ones.

    Let us select K (= M ?) points as distant as possible from all the others.
    I assume here that selecting such extremal points is feasible, 1 and 6 in your example.
    Note that the initial number of points could be lower than M.

    1. These initial points Pi determine K sets Si.
    2. Then, each Si is completed by the needed neighbors of Pi.
    3. For each set, we can determine the number of points that have enough neighbors.
    4. If K < M, we can determine M-K point as far as possible of the previous sets and set up new sets with these points and their neighbors.
    5. if all points are in a set with all their neighbors: STOP.
    6. Select a set with the lowest number of satisfied points, i.e. with all their neighbors. In this set, determine the points with the lowest number of missing neighbors. Select randomly such a point, and complete the set with the missing neighbors of this point
    7. go to step 3 until the stopping criteria is satisfied.

    A variant is to continue the process until each set has the required number of satisfied points.

    Each random process may provide a different solution. Several attempts can be performed in parallel on different nodes.

    In your simple example, the process provides the solution immediately :

    • S0 = {1} -> S0 = {1, 2, 3, 4, 5}
    • S1 = {6} -> S1 = {6, 5, 4, 3, 2}

    It may happen that two different sets have a same satisfied point. Even if this point must stay in each exented set, it could be removed from one of non-extended set

    One justification of this algorithm : I take for granted that extreme points must be in different non-extended sets. This implies that their neighbors must be present in the corresponded extended set Si.