Search code examples

Feature exatraction from dendrogram in R

Say you have the following matrix:

    V1  V2  V3  V4  V5
1   0   0   0   0   1
2   0   0   0   1   1
3   0   0   1   1   1
4   0   0   1   1   0
5   1   0   0   0   0
6   1   1   1   0   0
7   0   1   1   0   0
8   0   1   1   0   0
9   0   1   1   1   0
10  1   1   1   0   1

and you do a dendrogram say whatever way you want but here is what I did, where cmat is the above custom matrix:

distance <- dist(cmat, method="euclidean")
cluster <- hclust(distance, method="average")
plot(cluster, hang=-1)

Cluster Dendrogram on Custom Matrix "cmat"

Basically I want to know what features cause what breaks. Say if we are clustering above 1.5, and we can view this by using the code:

dnd = as.dendrogram(cluster)
plot(cut(dnd, h=1.5)$upper, main="Upper tree of cut at h=1.5")

and these produces: Upper tree cut of dendrogram at h=1.5

But notice how it has an arbitrary name "batch" .... I want to know:

Which feature of the 5 cause that first break? Then the next? Any Ideas? How to code this in? thx!!


  • Short answer: all of them. Euclidean distance is defined as sqrt(sum((x-y)**2)), so all features are used to compute the distances. This is NOT a decision tree that splits on a single feature.

    If you want some simple explanation like a decision tree, I suggest that you

    1. Produce a flat clustering by cutting the tree at the desired height

    2. Train a decision tree on the resulting clusters

    3. Analyze the decision tree.