Here's some data.
df <- data.frame(groupvar=rep(c('a','b'),100),v1=rnorm(200),v2=rnorm(200))
Now I do k means within each group:
require(dplyr)
kobjs = df %>% group_by(groupvar) %>%
do(kclust = kmeans(cbind(.$v1,.$v2),centers=5))
"kobjs" looks like this:
groupvar kclust
(fctr) (chr)
1 a <S3:kmeans>
2 b <S3:kmeans>
I'd like to grab the cluster assignment (and, ideally, the center points) and append them to the original data frame. I thought you could use broom to do this:
require(broom)
merged = kobjs %>%
group_by(groupvar) %>% do(augment(.$kclust[[1]],df))
But that somehow yields a 400X4 matrix instead of a 200X4. How did that happen? How do I get the behavior I want?
EDIT1: Solved the way I wanted it, with some insight from aosmith. There's probably a way to make it more elegant (is the left_join necessary?) but it's the behavior I want:
kobjs = df %>%
do(kmeans(cbind(.$v1,.$v2),centers=5) %>%
fitted(method="centers") %>%
data.frame(cluster=rownames(.),entry=1:length(.),row.names=NULL)) %>%
left_join(df %>% group_by(groupvar) %>% mutate(entry=1:n()),
by=c("entry","groupvar"))
At the moment you are using augment
on the entire df
rather than using just the subset for each group. This is why you are getting a dataset that's twice as long as you expected.
So you'd need to do something like the following using kobjs
. I set the seed to 16 before making kobjs
.
kobjs %>%
group_by(groupvar) %>%
do(augment(.$kclust[[1]], df[df$groupvar == .$groupvar,]))
Source: local data frame [200 x 5]
Groups: groupvar [2]
.rownames groupvar v1 v2 .cluster
(chr) (fctr) (dbl) (dbl) (fctr)
1 1 a 0.30291472 0.2203811 1
2 3 a -0.51381305 0.1480162 1
3 5 a -0.75246517 -0.6407782 2
4 7 a 0.06453416 1.2965984 3
5 9 a -0.62353541 -1.3240648 2
6 11 a 0.18435121 -1.0513837 5
7 13 a -0.26481666 2.8117979 4
8 15 a 0.56643441 0.1434451 1
9 17 a -0.30406035 -0.1477244 1
10 19 a 1.62538120 -0.5972593 5
.. ... ... ... ... ...
To get something more like what you want.
You do have other options. For example, you could use augment
in the original do
step:
set.seed(16)
df %>% group_by(groupvar) %>%
do(augment(kmeans(cbind(.$v1,.$v2),centers=5), .))
Source: local data frame [200 x 4]
Groups: groupvar [2]
groupvar v1 v2 .cluster
(fctr) (dbl) (dbl) (fctr)
1 a 0.30291472 0.2203811 1
2 a -0.51381305 0.1480162 1
3 a -0.75246517 -0.6407782 2
4 a 0.06453416 1.2965984 3
5 a -0.62353541 -1.3240648 2
6 a 0.18435121 -1.0513837 5
7 a -0.26481666 2.8117979 4
8 a 0.56643441 0.1434451 1
9 a -0.30406035 -0.1477244 1
10 a 1.62538120 -0.5972593 5
.. ... ... ... ...
You could also pull out the cluster
from the kmeans
object and add these to the dataset with the following do
coding. This doesn't use broom, though.
set.seed(16)
df %>% group_by(groupvar) %>%
do(data.frame(., kclust = kmeans(cbind(.$v1,.$v2),centers=5)$cluster))
Source: local data frame [200 x 4]
Groups: groupvar [2]
groupvar v1 v2 kclust
(fctr) (dbl) (dbl) (int)
1 a 0.30291472 0.2203811 1
2 a -0.51381305 0.1480162 1
3 a -0.75246517 -0.6407782 2
4 a 0.06453416 1.2965984 3
5 a -0.62353541 -1.3240648 2
6 a 0.18435121 -1.0513837 5
7 a -0.26481666 2.8117979 4
8 a 0.56643441 0.1434451 1
9 a -0.30406035 -0.1477244 1
10 a 1.62538120 -0.5972593 5
.. ... ... ... ...
Edit to add example of saving two things from a model in a single do
call.
You can fit and name model object in do
and then pull multiple summary values from it, but it involves the use of curly brackets (I'm not sure if they are included in your irrational fear of square brackets ;-) ).
Here are two ways, first creating model
, pulling out the fitted values as fit
, and binding it all together with the original dataset (that's what the first .
in data.frame
represents).
df %>% group_by(groupvar) %>%
do( {
model = kmeans(cbind(.$v1, .$v2), centers = 5)
fit = fitted(model, methods = "centers")
data.frame(., fit, cluster = rownames(fit), row.names = NULL)
})
I don't always like doing lots of naming, so the second option just works directly on model
and skips the fit
step.
df %>% group_by(groupvar) %>%
do( {
model = kmeans(cbind(.$v1, .$v2), centers = 5)
data.frame(., fitted(model, methods = "centers"), cluster = model$cluster, row.names = NULL)
})