Search code examples
rdplyrbroom

Wrangling clusters/centers of kmeans back into original data frame


Here's some data.

df <- data.frame(groupvar=rep(c('a','b'),100),v1=rnorm(200),v2=rnorm(200))

Now I do k means within each group:

require(dplyr)

kobjs = df %>% group_by(groupvar) %>%
  do(kclust = kmeans(cbind(.$v1,.$v2),centers=5))

"kobjs" looks like this:

  groupvar      kclust
    (fctr)       (chr)
1        a <S3:kmeans>
2        b <S3:kmeans>

I'd like to grab the cluster assignment (and, ideally, the center points) and append them to the original data frame. I thought you could use broom to do this:

require(broom)
merged = kobjs %>%
  group_by(groupvar) %>% do(augment(.$kclust[[1]],df))

But that somehow yields a 400X4 matrix instead of a 200X4. How did that happen? How do I get the behavior I want?

EDIT1: Solved the way I wanted it, with some insight from aosmith. There's probably a way to make it more elegant (is the left_join necessary?) but it's the behavior I want:

kobjs = df %>%
  do(kmeans(cbind(.$v1,.$v2),centers=5) %>%
       fitted(method="centers") %>% 
       data.frame(cluster=rownames(.),entry=1:length(.),row.names=NULL)) %>%
  left_join(df %>% group_by(groupvar) %>% mutate(entry=1:n()),
            by=c("entry","groupvar"))

Solution

  • At the moment you are using augment on the entire df rather than using just the subset for each group. This is why you are getting a dataset that's twice as long as you expected.

    So you'd need to do something like the following using kobjs. I set the seed to 16 before making kobjs.

    kobjs %>%
        group_by(groupvar) %>% 
        do(augment(.$kclust[[1]], df[df$groupvar == .$groupvar,]))
    
    Source: local data frame [200 x 5]
    Groups: groupvar [2]
    
       .rownames groupvar          v1         v2 .cluster
           (chr)   (fctr)       (dbl)      (dbl)   (fctr)
    1          1        a  0.30291472  0.2203811        1
    2          3        a -0.51381305  0.1480162        1
    3          5        a -0.75246517 -0.6407782        2
    4          7        a  0.06453416  1.2965984        3
    5          9        a -0.62353541 -1.3240648        2
    6         11        a  0.18435121 -1.0513837        5
    7         13        a -0.26481666  2.8117979        4
    8         15        a  0.56643441  0.1434451        1
    9         17        a -0.30406035 -0.1477244        1
    10        19        a  1.62538120 -0.5972593        5
    ..       ...      ...         ...        ...      ...
    

    To get something more like what you want.

    You do have other options. For example, you could use augment in the original do step:

    set.seed(16)
    df %>% group_by(groupvar) %>%
        do(augment(kmeans(cbind(.$v1,.$v2),centers=5), .))
    
    Source: local data frame [200 x 4]
    Groups: groupvar [2]
    
       groupvar          v1         v2 .cluster
         (fctr)       (dbl)      (dbl)   (fctr)
    1         a  0.30291472  0.2203811        1
    2         a -0.51381305  0.1480162        1
    3         a -0.75246517 -0.6407782        2
    4         a  0.06453416  1.2965984        3
    5         a -0.62353541 -1.3240648        2
    6         a  0.18435121 -1.0513837        5
    7         a -0.26481666  2.8117979        4
    8         a  0.56643441  0.1434451        1
    9         a -0.30406035 -0.1477244        1
    10        a  1.62538120 -0.5972593        5
    ..      ...         ...        ...      ...
    

    You could also pull out the cluster from the kmeans object and add these to the dataset with the following do coding. This doesn't use broom, though.

    set.seed(16)
    df %>% group_by(groupvar) %>%
        do(data.frame(., kclust = kmeans(cbind(.$v1,.$v2),centers=5)$cluster))
    
    Source: local data frame [200 x 4]
    Groups: groupvar [2]
    
       groupvar          v1         v2 kclust
         (fctr)       (dbl)      (dbl)  (int)
    1         a  0.30291472  0.2203811      1
    2         a -0.51381305  0.1480162      1
    3         a -0.75246517 -0.6407782      2
    4         a  0.06453416  1.2965984      3
    5         a -0.62353541 -1.3240648      2
    6         a  0.18435121 -1.0513837      5
    7         a -0.26481666  2.8117979      4
    8         a  0.56643441  0.1434451      1
    9         a -0.30406035 -0.1477244      1
    10        a  1.62538120 -0.5972593      5
    ..      ...         ...        ...    ...
    

    Edit to add example of saving two things from a model in a single do call.

    You can fit and name model object in do and then pull multiple summary values from it, but it involves the use of curly brackets (I'm not sure if they are included in your irrational fear of square brackets ;-) ).

    Here are two ways, first creating model, pulling out the fitted values as fit, and binding it all together with the original dataset (that's what the first . in data.frame represents).

    df %>% group_by(groupvar) %>%
        do( { 
            model = kmeans(cbind(.$v1, .$v2), centers = 5)
            fit = fitted(model, methods = "centers")
            data.frame(., fit, cluster = rownames(fit), row.names = NULL) 
        })
    

    I don't always like doing lots of naming, so the second option just works directly on model and skips the fit step.

    df %>% group_by(groupvar) %>%
        do( { 
            model = kmeans(cbind(.$v1, .$v2), centers = 5)
            data.frame(., fitted(model, methods = "centers"), cluster = model$cluster, row.names = NULL) 
        })