Search code examples
pythonalgorithmhierarchical-clusteringlinkage

How can I list out all current clusters when using a single linkage algorithm?


I'm now doing clustering on python using from scipy.cluster.hierarchy import linkage From the manual I know that it gives result in this form --> [A, B, length, #] which A and B is the indices of elements that are going to merge in this...stage(?), but can I get any information about clusters which is already merged but not going to participate in this stage?

For example, my dataset is:

A=[[1,1],[1,2],[1,3],[1,4],[1,5],
   [10,1],[10,2],[10,3],[10,4],[10,5],
   [15,1],[15,2],[15,3],[15,4],[15,5],
   [30,1],[30,2],[30,3],[30,4],[30,5]]

and apply single linkage algorithm on it

Z = linkage(A, 'single')

Z=[[  0.   4.   1.   2.]
   [  1.  20.   1.   3.]
   [  2.  21.   1.   4.]
   [  3.  22.   1.   5.]
   [ 17.  19.   1.   2.]
   [  5.   9.   1.   2.]
   [  6.  25.   1.   3.]
   [  7.  26.   1.   4.]
   [  8.  27.   1.   5.]
   [ 18.  24.   1.   3.]
   [ 10.  14.   1.   2.]
   [ 11.  30.   1.   3.]
   [ 12.  31.   1.   4.]
   [ 13.  32.   1.   5.]
   [ 16.  29.   1.   4.]
   [ 15.  34.   1.   5.]
   [ 28.  33.   5.  10.]
   [ 23.  36.   9.  15.]
   [ 35.  37.  15.  20.]]

here I choose 5 to be the distance limitation in clustering, so I get

[ 28. 33. 5. 10.]

then I traced 28 and 33 back to the original indices

cut = 5
temp1 = []
temp2 = []
for i in range(len(Z)):
if Z[i][2] >= cut:
    temp1.append(Z[i])
for i in range(2):
    temp2[i].append(int(temp1[0][i]))
for j in range(0, len(temp2)):
try:
    g = max(temp2[j])
except:
    continue
G = int(g - len(A))
while g >= len(A):
    ind = temp2[j].index(g)
    temp2[j].append(int(Z[G][0]))
    temp2[j].append(int(Z[G][1]))
    del temp2[j][ind]
    g = max(temp2[j])
    G = int(g - len(A))

and found that

temp2 = [[8, 7, 6, 5, 9], [13, 12, 11, 10, 14]]

which means '28' stands for points [10,1],[10,2],[10,3],[10,4],[10,5] and '33' stands for points [15,1],[15,2],[15,3],[15,4],[15,5], which clearly means that the cluster consist of [10,x] and the cluster consist of [15,x] are going to merge in this stage.

But obviously [1,1],[1,2],[1,3],[1,4],[1,5] and [30,1],[30,2],[30,3],[30,4],[30,5] must have formed another two clusters in earlier stage, so at the moment before [10,x] and [15,x] merge , currently there are 4 clusters

So the result I want is like:

temp2 = [[8, 7, 6, 5, 9], [13, 12, 11, 10, 14], [0, 1, 2, 3, 4], [15, 16, 17, 18, 19]]

What should I do to get the later two clusters T^T??


Solution

  • As described in the documentation, linkage gives you the distance between clusters, which is the same as the cophenetic distance between elements in those clusters. As described in other documentation, fcluster will give you flat clusters, and if you specify 'distance' as the criterion, will cut the dendrogram according to the cophenetic distance.

    So you can get what you want by using fcluster to threshold the clusters at your chosen distance. However, a slight wrinkle is that fcluster treats the threshold as the highest distance to lump, not the lowest distance to split, so if you use 5 as the threshold it will join the two clusters you're referring to and give you only three clusters. You'll have to choose a threshold slightly less than 5 to get what you want. For instance:

    from scipy.cluster import hierarchy as clust
    >>> clust.fcluster(Z, 4.99999, criterion='distance')
    array([2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1])
    

    This is telling you which cluster each item is in. To translate that back into lists of the indices in each cluster, you could use np.where:

    >>> clusters = clust.fcluster(Z, 4.99999, criterion='distance')
    >>> [np.where(clusters==k)[0].tolist() for k in np.unique(clusters)]
    [[15L, 16L, 17L, 18L, 19L],
     [0L, 1L, 2L, 3L, 4L],
     [5L, 6L, 7L, 8L, 9L],
     [10L, 11L, 12L, 13L, 14L]]
    

    In sum, the idea is to look at what you call the "distance limitation" and use fclust to get flat clusters with that distance (or rather, a slightly smaller distance) as the threshold. This will give you the cluster number of each index, and then you can use np.where to get a list for each cluster.