Search code examples
pythonscipycluster-computinghierarchical-clusteringdendrogram

I don't understand the detailed behavior of the threshold working in fcluster (method ='complete')


Xi=[[0,5,10,8,3],[5,0,1,3,2],[10,1,0,5,1],[8,3,5,0,6],[3,2,1,6,0]]

Xi = Distance matrix

shc.fcluster(shc.linkage(Xi,'complete'),9,criterion='distance')

in this code threshold = 9

after clustering result is array([3, 1, 1, 2, 1], dtype=int32)

i don't understand why not array [2 ,1 ,1, 1, 1]

this image means after clustering https://drive.google.com/file/d/17806FuPuNpJiqhT12jiuFOMGNUvB1vjT/view?usp=sharing


Solution

  • import numpy as np
    import pandas as pd
    from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
    from scipy.spatial.distance import pdist
    import matplotlib.pyplot as plt
    import seaborn as sns
    

    You have this distance matrix

    Xi = np.array([[0,5,10,8,3],[5,0,1,3,2],[10,1,0,5,1],[8,3,5,0,6],[3,2,1,6,0]])
    

    we can visualize as

    df = pd.DataFrame(Xi)
    # fill NaNs and mask 0s
    df.fillna(0, inplace=True)
    mask = np.zeros_like(df)
    mask[np.triu_indices_from(mask)] = True
    sns.heatmap(df, annot=True, fmt='.0f', cmap="YlGnBu", mask=mask);
    

    enter image description here

    Now, we get the pdist

    p = pdist(Xi)
    

    and the linkage

    Z = linkage(p, method='complete')
    

    You set 9 as threshold so

    dendrogram(Z)
    plt.axhline(9, color='k', ls='--');
    

    enter image description here

    you have 3 clusters

    fcluster(Z, 9, criterion='distance')
    
    array([3, 1, 1, 2, 1], dtype=int32)
    #      0  1  2  3  4   <- elements
    

    and it's correct, you can verify with the dendrogram that

    • elements 1, 2 and 4 in cluster 1
    • element 3 in cluster 2
    • element 0 in cluster 3

    If you want two cluster only, you have to choose 12, for example, as thershold

    dendrogram(Z)
    plt.axhline(12, color='k', ls='--');
    

    enter image description here

    and so you have your expected result

    fcluster(Z, 12, criterion='distance')
    
    array([2, 1, 1, 1, 1], dtype=int32)
    #      0  1  2  3  4   <- elements