I'm now doing clustering on python using from scipy.cluster.hierarchy import linkage
From the manual I know that it gives result in this form --> [A, B, length, #]
which A and B is the indices of elements that are going to merge in this...stage(?), but can I get any information about clusters which is already merged but not going to participate in this stage?
For example, my dataset is:
A=[[1,1],[1,2],[1,3],[1,4],[1,5],
[10,1],[10,2],[10,3],[10,4],[10,5],
[15,1],[15,2],[15,3],[15,4],[15,5],
[30,1],[30,2],[30,3],[30,4],[30,5]]
and apply single linkage algorithm on it
Z = linkage(A, 'single')
Z=[[ 0. 4. 1. 2.]
[ 1. 20. 1. 3.]
[ 2. 21. 1. 4.]
[ 3. 22. 1. 5.]
[ 17. 19. 1. 2.]
[ 5. 9. 1. 2.]
[ 6. 25. 1. 3.]
[ 7. 26. 1. 4.]
[ 8. 27. 1. 5.]
[ 18. 24. 1. 3.]
[ 10. 14. 1. 2.]
[ 11. 30. 1. 3.]
[ 12. 31. 1. 4.]
[ 13. 32. 1. 5.]
[ 16. 29. 1. 4.]
[ 15. 34. 1. 5.]
[ 28. 33. 5. 10.]
[ 23. 36. 9. 15.]
[ 35. 37. 15. 20.]]
here I choose 5 to be the distance limitation in clustering, so I get
[ 28. 33. 5. 10.]
then I traced 28 and 33 back to the original indices
cut = 5
temp1 = []
temp2 = []
for i in range(len(Z)):
if Z[i][2] >= cut:
temp1.append(Z[i])
for i in range(2):
temp2[i].append(int(temp1[0][i]))
for j in range(0, len(temp2)):
try:
g = max(temp2[j])
except:
continue
G = int(g - len(A))
while g >= len(A):
ind = temp2[j].index(g)
temp2[j].append(int(Z[G][0]))
temp2[j].append(int(Z[G][1]))
del temp2[j][ind]
g = max(temp2[j])
G = int(g - len(A))
and found that
temp2 = [[8, 7, 6, 5, 9], [13, 12, 11, 10, 14]]
which means '28' stands for points [10,1],[10,2],[10,3],[10,4],[10,5]
and '33' stands for points [15,1],[15,2],[15,3],[15,4],[15,5]
, which clearly means that the cluster consist of [10,x] and the cluster consist of [15,x] are going to merge in this stage.
But obviously [1,1],[1,2],[1,3],[1,4],[1,5]
and [30,1],[30,2],[30,3],[30,4],[30,5]
must have formed another two clusters in earlier stage, so at the moment before [10,x] and [15,x] merge , currently there are 4 clusters
So the result I want is like:
temp2 = [[8, 7, 6, 5, 9], [13, 12, 11, 10, 14], [0, 1, 2, 3, 4], [15, 16, 17, 18, 19]]
What should I do to get the later two clusters T^T??
As described in the documentation, linkage
gives you the distance between clusters, which is the same as the cophenetic distance between elements in those clusters. As described in other documentation, fcluster
will give you flat clusters, and if you specify 'distance'
as the criterion, will cut the dendrogram according to the cophenetic distance.
So you can get what you want by using fcluster
to threshold the clusters at your chosen distance. However, a slight wrinkle is that fcluster
treats the threshold as the highest distance to lump, not the lowest distance to split, so if you use 5 as the threshold it will join the two clusters you're referring to and give you only three clusters. You'll have to choose a threshold slightly less than 5 to get what you want. For instance:
from scipy.cluster import hierarchy as clust
>>> clust.fcluster(Z, 4.99999, criterion='distance')
array([2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1])
This is telling you which cluster each item is in. To translate that back into lists of the indices in each cluster, you could use np.where
:
>>> clusters = clust.fcluster(Z, 4.99999, criterion='distance')
>>> [np.where(clusters==k)[0].tolist() for k in np.unique(clusters)]
[[15L, 16L, 17L, 18L, 19L],
[0L, 1L, 2L, 3L, 4L],
[5L, 6L, 7L, 8L, 9L],
[10L, 11L, 12L, 13L, 14L]]
In sum, the idea is to look at what you call the "distance limitation" and use fclust
to get flat clusters with that distance (or rather, a slightly smaller distance) as the threshold. This will give you the cluster number of each index, and then you can use np.where
to get a list for each cluster.