Search code examples
pythongroupingcluster-analysisconnected-components

Clustering by repeated data


Is there a way to split a data set consisting of pairs of 3D points (or just their index numbers) into connected clusters? That is, two pairs (a,b) and (c,d) should be in the same cluster if they share a common point (i.e. a = c, b = c, a = d or b = d) or if there is a chain of one or more other pairs, each sharing a common point with the previous one, from one pair to the other.

For example, the list of pairs:

[[1,2],[2,3],[4,5],[6,7],[7,8],[9,4],[8,5]]

would be grouped as follows:

[[1,2],[2,3]]

[[4,5],[6,7],[7,8],[9,4],[8,5]]

In the first cluster, the pairs (1,2) and (2,3) have the point 2 in common, and share no points with any pairs outside the cluster. In the second cluster, the pair (4,5) shares common points with (9,4) and (8,5), while (8,5) has a common point with (7,8), which has a common point with (6,7).

The data is originally stored in a numpy array, but the output format is not too important.

I need to be able to access the data that makes up each individual cluster afterwards.


Solution

  • Using networkx:

    import networkx
    
    edges = [[1, 2], [2, 3], [4, 5], [6, 7], [7, 8], [9, 4], [8, 5]]
    
    graph = networkx.Graph(edges)
    print(list(networkx.connected_components(graph)))
    

    Output:

    [set([1, 2, 3]), set([4, 5, 6, 7, 8, 9])]