r cluster-analysis hierarchical-clustering

Rclusterpp.hclust not providing correct clusters when using cutree

I have a fairly large dataset of about 75,000 observations and 7 columns which consist of alarm data details which stats:hclust cannot support (crashes RStudio). From a few searches I found Rclusterpp.hclust which is reported to reduce the complexity and resource allocation for Hierarchical Clustering, so I gave it a try. It takes about 5 mins or so and does provide a dendrogram, but if I attempt to use cutree and specify either a height or a number of clusters I get strange results. I see this same problem when using a small sample of 38 observations as demonstrated below. Am I doing something wrong or is this an issue with the Rclusterpp.hclust package? (running package 3.4.1 in R 3.4.1)

Sample dataset looks like this:

dataset
#   DAY COUNT LOCATION M1 M2 HOURS SOURCE
#1  238     2   222307  1  1  5437   1008
#2  238     1   222307  2  1  5437   1008
#3  238     5   222307  3  2  5437   1008
#4  238     2   222307  4  3  5437   1008
#5  238    14   222307  5  1  5437   1008
#6  238     4   222307  5  1  5437   1008
#7  238    14   222307  6  2  5437   1008
#8  238     3   222307  1  1  5437   1008
#9  238     1   222307  2  1  5437   1008
#10 238     1   222307  4  3  5437   1008
#11 238     2   222307  4  3  5437   1008
#12 238     2   222307  4  3  5437   1008
#13 238     5   222307  5  1  5437   1008
#14 238    11   222307  5  1  5437   1008
#15 238     1   222307  5  1  5437   1008
#16 238     3   222307  5  1  5437   1008
#17 238    18   222307  6  2  5437   1008
#18 238     2   222307  7  4  5437      9
#19 238     2   222307  8  4  5437     10
#20 238     3   222307  9  5  5437   1008
#21 238     2   222307 10  6  5437    865
#22 238     9   222307 11  7  5437     10
#23 238     2   222307 12  7  5437     10
#24 238     1   222307 12  7  5437     10
#25 238     5   222307 11  7  5437     10
#26 238     2   222307  8  4  5437     10
#27 238     3   222307 13  8  5437    864
#28 238     3   222307 14  8  5437    864
#29 238     1   222307 11  7  5437     10
#30 238     3   222307 11  7  5437     10
#31 238     2   222307 15  7  5437     10
#32 238     5   222307 11  7  5437     10
#33 238     2   222307 16  7  5437     10
#34 238     2   222307 17  7  5437     10
#35 238     3   222307 18  7  5437     10
#36 238     2   222307 15  7  5437     10
#37 238     6   222307 11  7  5437     10
#38 238     3   222307 19  7  5437     10

DAY,HOURS and COUNT are real numeric values, whereas LOCATION,M1,M2 and SOURCE are numerically coded categorical values.

Using stats:hclust I can get a cluster which does represent the data well and does distinguish the 2 primary clusters of alarm events among all observations in this sample as expected (i.e. the observation numbers in the dendrogram are alarms that should be grouped together):

d1 <- dist((as.matrix(scale(dataset))))
hc1 <- hclust(d1, method = "single")
cutree(hc1,2)
# 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 #27 28 29 30 31 32 33 34 35 36 37 38 
# 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  2  1  1  2  2  2  2  2  #1  1  2  2  2  2  2  2  2  2  2  2 
plot(hc1)

However if I do the same in Rclusterpp:hclust I get more clusters than what I am specifying (In this case I got 3 when I asked for 2 as shown in this small sample). When I run this on my large dataset I get almost 20,000 clusters when only asking for a few.

d2 <- dist((as.matrix(scale(dataset))))
hc2 <- Rclusterpp.hclust(d2, method = "single")
cutree(hc2,2)
# 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 #27 28 29 30 31 32 33 34 35 36 37 38 
# 1  1  1  1  1  1  1  1  1  1  2  2  1  1  1  1  1  3  3  1  1  3  3  3  3  3  #1  1  3  3  3  3  3  3  3  3  3  3 
plot(hc2)

Any idea why this is happening? Thanks.

Solution

I have looked into this a little and it appears that the return value of Rclusterpp.hclust is not fully aligned (wrt. the merge matrix) with stats' hclust.

From the documentation of hclust, the merge component of the returned list is:

an n-1 by 2 matrix. Row i of merge describes the merging of clusters at step i of the clustering. If an element j in the row is negative, then observation -j was merged at this stage. If j is positive then the merge was with the cluster formed at the (earlier) stage j of the algorithm. Thus negative entries in merge indicate agglomerations of singletons, and positive entries indicate agglomerations of non-singletons.

For the C implementation of cutree, it seems that the word in parentheses (earlier) is important.

Looking at head(hc2$merge), we see the following:

     [,1] [,2]
[1,]   -2   -9
[2,]  -25  -32
[3,]  -31  -36
[4,]  -19  -26
[5,]   -4    6
[6,]  -11  -12

So on the fifth row, there is a "pointer" to the sixth step, which is going in an unexpected direction.

If instead we would re-arrange the merge component (swapping the rows and "pointers"), things look ok:

# non-generic replacements for specific data example
hc3 <- hc2
hc3$merge[5, ] <- c(-11,-12)
hc3$merge[6, ] <- c(-4,5)
hc3$merge[13, ] <- c(-10,6)
cutree(hc3, 2)

You could write a function to handle this re-structuring of the merge matrix, such that things always work as you would like (maybe a wrapper around cutree).

Finally note that there is an issue on Github about this, where you can find some discussion and cross-package comparison:
https://github.com/nolanlab/Rclusterpp/issues/4