Search code examples
pythondataframehierarchical-clustering

hierarchical clustering with single linkage in python dataframe


I have a pandas dataframe to do hierarchical clustering.

    A   B   C
A   0   1   3
B   1   0   2
C   3   2   0

The code I tried:

z=linkage(df,'single')
dn = dendrogram(z,labels=index)

then I got a strange outcome: A&B as a cluster with distance 1.73 (correct should be 1), then A&B&C as a cluster with distance 3.46(correct should be 2).


Solution

  • The default distance used in scipy.cluster.hierarchy.linkage is the euclidean distance, defined as d(x,y) = \sqrt(\sum(x_i-y_i)) (you can check it here). I think the reason why you got confused is because you were taking the average (and computing the root mean squared error).

    So in your case d(A,B) = \sqrt(3) = 1.73

    Then, since your linkage is single, the distance between (A,B) and C is the minimum between d(A,C) and d(B,C), which is d(B,C) = \sqrt(12)