Search code examples
rdplyrlapplyk-meanssapply

R - Clustering (K-means) within groups


I need help clustering my data within assigned groups...

I have the following dataframe:

# Generate data frame
set.seed(1)
df1 <- data.frame(
  start.x = sample(1:20),
  start.y = sample(1:20),
  end.x = sample(1:20),
  end.y = sample(1:20)
)

I've used K-means to group it:

# Group using K-means
groups <- kmeans(df1[,c('start.x', 'start.y', 'end.x', 'end.y')], 4)
df1$group <- as.factor(groups$cluster)

Now I want to use K-means again to cluster it within the groups I've just created and assign the results to a new column in the dataframe.

Does anyone know how to do this or have a shorter way to complete both steps simultaneously.

Thanks...


Solution

  • We can use the first group to split the data and apply kmeans to only subset of data. Make sure to use correct number of k though because it depends on how the first group is created.

    library(dplyr)
    library(purrr)
    
    df1 %>%
      group_split(group = kmeans(.[,c('start.x', 'start.y', 'end.x', 'end.y')], 
                                 4)$cluster) %>%
       map_df(~.x %>% mutate(new_group = 
         kmeans(.x[,c('start.x', 'start.y', 'end.x', 'end.y')], 2)$cluster))
    

    In base R, you could use by which does split, apply and combine operation.

    df1$new_group <- unlist(by(df1, df1$group, function(x) 
            kmeans(x[,c('start.x', 'start.y', 'end.x', 'end.y')], 2)$cluster))