I have a pandas dataframe to do hierarchical clustering.
A B C
A 0 1 3
B 1 0 2
C 3 2 0
The code I tried:
z=linkage(df,'single')
dn = dendrogram(z,labels=index)
then I got a strange outcome: A&B as a cluster with distance 1.73 (correct should be 1), then A&B&C as a cluster with distance 3.46(correct should be 2).
The default distance used in scipy.cluster.hierarchy.linkage
is the euclidean distance, defined as d(x,y) = \sqrt(\sum(x_i-y_i))
(you can check it here). I think the reason why you got confused is because you were taking the average (and computing the root mean squared error).
So in your case d(A,B) = \sqrt(3) = 1.73
Then, since your linkage is single
, the distance between (A,B) and C is the minimum between d(A,C) and d(B,C), which is d(B,C) = \sqrt(12)