Search code examples
pythondebuggingscipydata-analysishierarchical-clustering

Scipy's cut_tree() doesn't return requested number of clusters and the linkage matrices obtained with scipy and fastcluster do not match


I'm doing an agglomerative hierarchical clustering (AHC) experiment using the fastcluster package in connection with scipy.cluster.hierarchy module functions, in Python 3, and I found a puzzling behaviour of the cut_tree() function.

I cluster data with no problem and get a linkage matrix, Z, using linkage_vector() with method=ward. Then, I want to cut the dendogram tree to get a fixed number of clusters (e.g. 33) and I do this properly using cut_tree(Z, n_clusters=33). (Remember that AHC is a deterministic method yielding a binary tree connecting all your datapoints, which sit at the leafs of the tree; you can look at this tree at any level to "see" the number of clusters you want in the end; all cut_tree() does is to return a set of 'n_cluster' integer labels from 0 to n_clusters - 1, attributed to every point of the dataset.)

I've done this many times in other experiments and I always get the number of clusters I request. The problem is that with this one dataset, when I ask cut_tree() for 33 clusters, it gives me only 32. I don't see why this is the case. Could it be a bug? Are you aware of any bug with cut_tree()? I tried to debug this behaviour and performed the same clustering experiment using scipy's linkage() function. With the resulting linkage matrix as input to cut_tree() I didn't get an unexpected number of clusters as output. I also verified that the linkage matrices output by the two methods are not equal.

The [dataset] I'm using consists of 10680 vectors, each with 20 dimensions. Check the following experiment:

import numpy as np
import fastcluster as fc
import scipy.cluster.hierarchy as hac
from scipy.spatial.distance import pdist

### *Load dataset (10680 vectors, each with 20 dimensions)*
X = np.load('dataset.npy')

### *Hierarchical clustering using traditional scipy method*
dists = pdist(X)
Z_1 = hac.linkage(dists, method='ward')

### *Hierarchical clustering using optimized fastcluster method*
Z_2 = fc.linkage_vector(X, method='ward')

### *Comparissons*

## Are the linkage matrices equal?
print("Z_1 == Z_2 ? ", np.allclose(Z_1, Z_2))

## Is scipy's cut_tree() returning the requested number of clusters when using Z_2?
print("Req.\tGot\tequal?")
for i in range(1,50):
    cut = hac.cut_tree(Z_2, i)
    uniq = len(np.unique(cut))
    print(i,"\t",uniq,"\t",i==uniq)

## The same as before, but in condensed form. When requesting cut_tree() for clusters
#  in the range [1,50] does it return wrong results at some point?
print("Any problem cutting Z_1 for n_clusters in [1,50]? ", not np.all([len(np.unique(
                                      hac.cut_tree(Z_1, i)))==i for i in range(1,50)]))
print("Any problem cutting Z_2 for n_clusters in [1,50]? ", not np.all([len(np.unique(
                                      hac.cut_tree(Z_2, i)))==i for i in range(1,50)]))

#Output:
#
#Z_1 == Z_2 ?  False
#
#Req.    Got     equal?
#1        1       True
#2        2       True
#3        3       True
#4        4       True
#5        5       True
#6        6       True
#7        7       True
#8        8       True
#9        9       True
#10       10      True
#11       11      True
#12       12      True
#13       13      True
#14       14      True
#15       15      True
#16       16      True
#17       17      True
#18       18      True
#19       19      True
#20       20      True
#21       21      True
#22       22      True
#23       23      True
#24       24      True
#25       25      True
#26       26      True
#27       27      True
#28       28      True
#29       29      True
#30       30      True
#31       31      True
#32       32      True
#33       32      False
#34       33      False
#35       34      False
#36       35      False
#37       36      False
#38       37      False
#39       38      False
#40       39      False
#41       40      False
#42       41      False
#43       42      False
#44       43      False
#45       44      False
#46       45      False
#47       46      False
#48       47      False
#49       48      False
#
#Any problem cutting Z_1 for n_clusters in [1,50]?  False
#Any problem cutting Z_2 for n_clusters in [1,50]?  True

You might have noticed the dataset contains 37 vectors with at least an exact copy, and counting all the copies there is a total 55 vectors with at least a copy in the dataset.

For inspection, I decided to plot the dendrogram tree up to a shallow depth level for the two linkage matrices, which you can see on the image bellow (Z_1 at the top and Z_2 at the bottom). Numbers inside parenthesis indicate the population contained bellow in that branch; numbers without parenthesis are leafs of the tree (the number is the index of the vector in the X matrix). One can see the only difference (at the plotted level) is at the branches marked with the red square, which coalesce at 0 distance as they contain overlapping vectors.

dendrograms

So, I ran the clustering procedures as shown in the previous code again, but this time with only the subset of the data containing the 55 vectors which have at least a copy. I obtained X_subset with:

uniqs, uniqs_indices, uniqs_count = np.unique(X, axis=0, return_index=True, return_counts=True)
duplicate_rows_indices = list( set(range(len(X))) - set(uniqs_indices) )
number_of_duplicate_rows = len(X)-len(uniqs) # 37

all_duplicate_rows = set()
for i in duplicate_rows_indices:
    _rows = set(np.where(X == X[i])[0])
    for j in _rows:
        all_duplicate_rows.add(j)

rows_with_at_least_a_copy = list(all_duplicate_rows)
number_of_rows_with_at_least_a_copy = len(rows_with_at_least_a_copy)  # 55

X_subset = X[rows_with_at_least_a_copy]

and my output this time was:

#Z_1 == Z_2 ?  False
#Req.    Got     equal?
#1        1       True
#2        2       True
#3        2       False
#4        3       False
#5        4       False
#6        5       False
#7        6       False
#8        7       False
#9        8       False
#10       9       False
#11       10      False
#12       11      False
#13       12      False
#14       13      False
#15       14      False
#16       15      False
#17       16      False
#18       17      False
#19       18      False
#20       20      True
#21       21      True
#22       22      True
#23       23      True
#24       24      True
#25       25      True
#26       26      True
#27       27      True
#28       28      True
#29       29      True
#30       30      True
#31       31      True
#32       32      True
#33       33      True
#34       34      True
#35       35      True
#36       36      True
#37       37      True
#38       38      True
#39       39      True
#40       40      True
#41       41      True
#42       42      True
#43       43      True
#44       44      True
#45       45      True
#46       46      True
#47       47      True
#48       48      True
#49       49      True
#Any problem cutting Z_1 for n_clusters in [1,50]?  False
#Any problem cutting Z_2 for n_clusters in [1,50]?  True

Thus, fastcluster and scipy are not returning the same results, and if it is only due to the overlapping points this could be acceptable because of the ambiguity of that clustering situation. But the problem is cut_tree() which sometimes doesn't return the requested number of clusters in these cases when given the linkage matrix obtained by linkage_vector(). How can this be fixed?

Library versions used: scipy '0.19.1', numpy '1.13.3', fastcluster '1.1.24'

Edit: It's also posted here: https://github.com/scipy/scipy/issues/7977.


Solution

  • First, I am not concerned about the differences in the output of the three methods scipy.cluster.hierarchy.linkage(), fastcluster.linkage(), and fastcluster.linkage_vector(). Differences may arise for two reasons:

    • Small differences in cluster distances due to the limitations of floating-point arithmetic (algebraically equivalent formulae yield different results).

    • Aside from arithmetic differences, the algorithms are allowed to resolve ties differently. A “tie“ in this context are two pairs of clusters (A,B), (C,D) with exactly the same distances between A and B, and C and D. As the OP wrote, there are e.g. numerous pairs of points with distance 0 at the beginning of the process, which the algorithms may cluster in any order.

    Second, the question why scipy.cluster.hierarchy.cut_tree() does not yield the desired number of clusters is the really interesting issue here. Algebraically, the “Ward” clustering method cannot produce so-called “inversions” in the dendrogram. (An inversion occurs when a clustering distance in one step is bigger than the clustering distance in the next step.) That is, the sequence of distances in the third column of the stepwise dendrogram (the output of linkage()) is ideally a monotonically increasing sequence for the “Ward” method. However, due to floating-point inaccuracies, in the dataset supplied by the OP, the clustering distance in step 35 is reported as 2.2e−16 by fastcluster.linkage_vector(), but 0.0 in the next step 36.

    SciPy's cut_tree() does unfortunately not handle these inversions well. Even when they happen “deep down” in the dendrogram (like in the present case), this confuses the algorithm for the entire rest of the merging process, with the effect of wrongly formed clusters. (At first glance, I believe that the dendrogram is not only “cut” at the wrong level but that the clusters are actually wrong. I have not analyzed the situation completely, though.)

    This is even more troubling as inversions occur not only by numerical inaccuracies as in the present case, but the “centroid” and “median” clustering methods produce them quite often, even with perfect arithmetic.

    Lastly, an imperfect solution: In order to resolve the problem, replace two marked lines in the loop in SciPy's cut_tree() function:

    for i, node in enumerate(nodes):
        idx = node.pre_order()
        this_group = last_group.copy()
        # ------------- Replace this:
        this_group[idx] = last_group[idx].min()
        this_group[this_group > last_group[idx].max()] -= 1
        # -------------
        if i + 1 in cols_idx:
            groups[np.where(i + 1 == cols_idx)[0]] = this_group
        last_group = this_group
    

    by the following lines:

        unique_idx = np.unique(last_group[idx])
        this_group[idx] = unique_idx[0]
        for ui in unique_idx[:0:-1]:
            this_group[this_group > ui] -= 1
    

    (Look at the SciPy source code for context.)

    However, I would rather recommend to reimplement the cut_tree() method from scratch, as the current implementation is overly complicated, and the task can be performed more efficiently in terms of runtime complexity.