I have a fairly large dataset of about 75,000 observations and 7 columns which consist of alarm data details which stats:hclust
cannot support (crashes RStudio
). From a few searches I found Rclusterpp.hclust
which is reported to reduce the complexity and resource allocation for Hierarchical Clustering, so I gave it a try. It takes about 5 mins or so and does provide a dendrogram, but if I attempt to use cutree
and specify either a height or a number of clusters I get strange results. I see this same problem when using a small sample of 38 observations as demonstrated below. Am I doing something wrong or is this an issue with the Rclusterpp.hclust package? (running package 3.4.1 in R 3.4.1)
Sample dataset looks like this:
dataset
# DAY COUNT LOCATION M1 M2 HOURS SOURCE
#1 238 2 222307 1 1 5437 1008
#2 238 1 222307 2 1 5437 1008
#3 238 5 222307 3 2 5437 1008
#4 238 2 222307 4 3 5437 1008
#5 238 14 222307 5 1 5437 1008
#6 238 4 222307 5 1 5437 1008
#7 238 14 222307 6 2 5437 1008
#8 238 3 222307 1 1 5437 1008
#9 238 1 222307 2 1 5437 1008
#10 238 1 222307 4 3 5437 1008
#11 238 2 222307 4 3 5437 1008
#12 238 2 222307 4 3 5437 1008
#13 238 5 222307 5 1 5437 1008
#14 238 11 222307 5 1 5437 1008
#15 238 1 222307 5 1 5437 1008
#16 238 3 222307 5 1 5437 1008
#17 238 18 222307 6 2 5437 1008
#18 238 2 222307 7 4 5437 9
#19 238 2 222307 8 4 5437 10
#20 238 3 222307 9 5 5437 1008
#21 238 2 222307 10 6 5437 865
#22 238 9 222307 11 7 5437 10
#23 238 2 222307 12 7 5437 10
#24 238 1 222307 12 7 5437 10
#25 238 5 222307 11 7 5437 10
#26 238 2 222307 8 4 5437 10
#27 238 3 222307 13 8 5437 864
#28 238 3 222307 14 8 5437 864
#29 238 1 222307 11 7 5437 10
#30 238 3 222307 11 7 5437 10
#31 238 2 222307 15 7 5437 10
#32 238 5 222307 11 7 5437 10
#33 238 2 222307 16 7 5437 10
#34 238 2 222307 17 7 5437 10
#35 238 3 222307 18 7 5437 10
#36 238 2 222307 15 7 5437 10
#37 238 6 222307 11 7 5437 10
#38 238 3 222307 19 7 5437 10
DAY
,HOURS
and COUNT
are real numeric values, whereas LOCATION
,M1
,M2
and SOURCE
are numerically coded categorical values.
Using stats:hclust I can get a cluster which does represent the data well and does distinguish the 2 primary clusters of alarm events among all observations in this sample as expected (i.e. the observation numbers in the dendrogram are alarms that should be grouped together):
d1 <- dist((as.matrix(scale(dataset))))
hc1 <- hclust(d1, method = "single")
cutree(hc1,2)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 #27 28 29 30 31 32 33 34 35 36 37 38
# 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 2 2 2 #1 1 2 2 2 2 2 2 2 2 2 2
plot(hc1)
However if I do the same in Rclusterpp:hclust
I get more clusters than what I am specifying (In this case I got 3 when I asked for 2 as shown in this small sample). When I run this on my large dataset I get almost 20,000 clusters when only asking for a few.
d2 <- dist((as.matrix(scale(dataset))))
hc2 <- Rclusterpp.hclust(d2, method = "single")
cutree(hc2,2)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 #27 28 29 30 31 32 33 34 35 36 37 38
# 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 3 3 1 1 3 3 3 3 3 #1 1 3 3 3 3 3 3 3 3 3 3
plot(hc2)
Any idea why this is happening? Thanks.
I have looked into this a little and it appears that the return value of Rclusterpp.hclust
is not fully aligned (wrt. the merge
matrix) with stats'
hclust
.
From the documentation of hclust
, the merge
component of the returned list is:
an n-1 by 2 matrix. Row i of merge describes the merging of clusters at step i of the clustering. If an element j in the row is negative, then observation -j was merged at this stage. If j is positive then the merge was with the cluster formed at the (earlier) stage j of the algorithm. Thus negative entries in merge indicate agglomerations of singletons, and positive entries indicate agglomerations of non-singletons.
For the C
implementation of cutree
, it seems that the word in parentheses (earlier
) is important.
Looking at head(hc2$merge)
, we see the following:
[,1] [,2]
[1,] -2 -9
[2,] -25 -32
[3,] -31 -36
[4,] -19 -26
[5,] -4 6
[6,] -11 -12
So on the fifth row, there is a "pointer" to the sixth step, which is going in an unexpected direction.
If instead we would re-arrange the merge
component (swapping the rows and "pointers"), things look ok:
# non-generic replacements for specific data example
hc3 <- hc2
hc3$merge[5, ] <- c(-11,-12)
hc3$merge[6, ] <- c(-4,5)
hc3$merge[13, ] <- c(-10,6)
cutree(hc3, 2)
You could write a function to handle this re-structuring of the merge
matrix, such that things always work as you would like (maybe a wrapper around cutree
).
Finally note that there is an issue on Github about this, where you can find some discussion and cross-package comparison:
https://github.com/nolanlab/Rclusterpp/issues/4