I have a dataframe resulting from a grouping:
df = structure(
list(latitude = c(23.13658528, 23.49099973,
23.49138397, 23.49053121, 23.45386579, 23.44084801,
23.4463272, 23.78719913, 23.4552457, 23.43415363,
23.45229713, 23.44412288, 23.439268),
longitude = c(-11.64711391, -11.67840419, -11.68223218,
-11.68325966, -11.94486056, -11.93216568, -11.9466947,
-11.86433498, -11.94421019, -11.92623012, -11.94499532,
-11.93692585, -11.927615),
cluster = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L)),
row.names = 1483:1495, class = "data.frame"
)
d = dist(df)
h = hclust(d)
cluster = cutree(h, k = 3)
df['cluster'] = cluster
I would like to continue partitioning the dataframe until I get groups with at most two elements and that preserves the hierarchy, a field like cluster father is enough. I know it's possible to do this directly, but I'd like to do it recursively, because I need to do a first grouping and then group the result. I racked my brains a lot to assemble a recursive algorithm, but so far I haven't been able to.
I researched how to work with trees in R, but I was not successful. Any solution would be valid.
Thanks in advance!
The following function will apply cutree(hclust(dist(df)))
to your data, then recursively apply it to each cluster within the last iteration. The main problem is how you represent this. Rather than creating new columns, I have chosen to append the subclusters as a character string, but this could be changed droending on your needs.
repartition <- function(dat, k = 3) {
multi_cut <- function(data, n = k) {
cluster <- cutree(hclust(dist(data[1:2])), k = n)
if(!"cluster" %in% names(data)) data$cluster <- cluster
else data$cluster <- paste(data$cluster, cluster)
do.call(rbind, lapply(split(data, cluster), function(x) {
if(nrow(x) < 3) return(x) else return(multi_cut(x, n = k))
}))
}
dat <- multi_cut(dat)
rownames(dat) <- NULL
dat$cluster <- stringr::str_pad(dat$cluster, side = "right",
width = max(nchar(dat$cluster)))
dat
}
testing it on your data we have:
repartition(df)
#> latitude longitude cluster
#> 1 23.13659 -11.64711 1
#> 2 23.49100 -11.67840 2 1 1
#> 3 23.49138 -11.68223 2 1 2
#> 4 23.49053 -11.68326 2 1 3
#> 5 23.45387 -11.94486 2 2 1 1
#> 6 23.45525 -11.94421 2 2 1 2
#> 7 23.45230 -11.94500 2 2 1 3
#> 8 23.44633 -11.94669 2 2 2
#> 9 23.44412 -11.93693 2 2 3
#> 10 23.44085 -11.93217 2 3 1
#> 11 23.43415 -11.92623 2 3 2
#> 12 23.43927 -11.92761 2 3 3
#> 13 23.78720 -11.86433 3
EDIT
To access different levels of clustering, we could use substr
. For example, plotting only the first level with ggplot:
library(ggplot2)
ggplot(repartition(df),
aes(longitude, latitude, color = substr(cluster, 1, 1))) +
geom_point()
Now plotting the second level:
ggplot(repartition(df),
aes(longitude, latitude, color = substr(cluster, 1, 3))) +
geom_point()
Created on 2022-08-29 with reprex v2.0.2