Search code examples
rtreepartitioninghierarchical-data

Recursive partitioning according to a criterion in R


I have a dataframe resulting from a grouping:

df = structure(
  list(latitude = c(23.13658528, 23.49099973,
    23.49138397, 23.49053121, 23.45386579, 23.44084801,
    23.4463272, 23.78719913, 23.4552457, 23.43415363,
    23.45229713, 23.44412288, 23.439268),
  longitude = c(-11.64711391, -11.67840419, -11.68223218,
    -11.68325966, -11.94486056, -11.93216568, -11.9466947,
    -11.86433498, -11.94421019, -11.92623012, -11.94499532,
    -11.93692585, -11.927615),
  cluster = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L)),
  row.names = 1483:1495, class = "data.frame"
)

d = dist(df)
h = hclust(d)
cluster = cutree(h, k = 3)
df['cluster'] = cluster

I would like to continue partitioning the dataframe until I get groups with at most two elements and that preserves the hierarchy, a field like cluster father is enough. I know it's possible to do this directly, but I'd like to do it recursively, because I need to do a first grouping and then group the result. I racked my brains a lot to assemble a recursive algorithm, but so far I haven't been able to.

I researched how to work with trees in R, but I was not successful. Any solution would be valid.

Thanks in advance!


Solution

  • The following function will apply cutree(hclust(dist(df))) to your data, then recursively apply it to each cluster within the last iteration. The main problem is how you represent this. Rather than creating new columns, I have chosen to append the subclusters as a character string, but this could be changed droending on your needs.

    repartition <- function(dat, k = 3) {
    
      multi_cut <- function(data, n = k) {
      
        cluster <- cutree(hclust(dist(data[1:2])), k = n)
        if(!"cluster" %in% names(data)) data$cluster <- cluster
        else data$cluster <- paste(data$cluster, cluster)
        do.call(rbind, lapply(split(data, cluster), function(x) {
          if(nrow(x) < 3) return(x) else return(multi_cut(x, n = k))
        }))
      }
      
      dat <- multi_cut(dat)
      rownames(dat) <- NULL
      dat$cluster <- stringr::str_pad(dat$cluster, side = "right", 
                                      width = max(nchar(dat$cluster)))
      dat
    }
    

    testing it on your data we have:

    repartition(df)
    #>    latitude longitude cluster
    #> 1  23.13659 -11.64711 1      
    #> 2  23.49100 -11.67840 2 1 1  
    #> 3  23.49138 -11.68223 2 1 2  
    #> 4  23.49053 -11.68326 2 1 3  
    #> 5  23.45387 -11.94486 2 2 1 1
    #> 6  23.45525 -11.94421 2 2 1 2
    #> 7  23.45230 -11.94500 2 2 1 3
    #> 8  23.44633 -11.94669 2 2 2  
    #> 9  23.44412 -11.93693 2 2 3  
    #> 10 23.44085 -11.93217 2 3 1  
    #> 11 23.43415 -11.92623 2 3 2  
    #> 12 23.43927 -11.92761 2 3 3  
    #> 13 23.78720 -11.86433 3
    

    EDIT

    To access different levels of clustering, we could use substr. For example, plotting only the first level with ggplot:

    library(ggplot2)
    
    ggplot(repartition(df), 
           aes(longitude, latitude, color = substr(cluster, 1, 1))) +
      geom_point()
    

    enter image description here

    Now plotting the second level:

    ggplot(repartition(df), 
           aes(longitude, latitude, color = substr(cluster, 1, 3))) +
      geom_point()
    

    enter image description here

    Created on 2022-08-29 with reprex v2.0.2