Say you have the following matrix:
V1 V2 V3 V4 V5
1 0 0 0 0 1
2 0 0 0 1 1
3 0 0 1 1 1
4 0 0 1 1 0
5 1 0 0 0 0
6 1 1 1 0 0
7 0 1 1 0 0
8 0 1 1 0 0
9 0 1 1 1 0
10 1 1 1 0 1
and you do a dendrogram say whatever way you want but here is what I did, where cmat is the above custom matrix:
distance <- dist(cmat, method="euclidean")
cluster <- hclust(distance, method="average")
plot(cluster, hang=-1)
Basically I want to know what features cause what breaks. Say if we are clustering above 1.5, and we can view this by using the code:
dnd = as.dendrogram(cluster)
plot(cut(dnd, h=1.5)$upper, main="Upper tree of cut at h=1.5")
and these produces:
But notice how it has an arbitrary name "batch" .... I want to know:
Which feature of the 5 cause that first break? Then the next? Any Ideas? How to code this in? thx!!
Short answer: all of them. Euclidean distance is defined as sqrt(sum((x-y)**2))
, so all features are used to compute the distances. This is NOT a decision tree that splits on a single feature.
If you want some simple explanation like a decision tree, I suggest that you
Produce a flat clustering by cutting the tree at the desired height
Train a decision tree on the resulting clusters
Analyze the decision tree.