Given a dataframe,
ID <- c("a","b","b","c","c","c","d","d","d")
dummy1 <- c(1,0,1,1,0,0,1,1,0)
dummy2 <- c(0,0,0,0,1,1,1,1,1)
dummy3 <- c(1,0,0,1,1,0,0,1,1)
df <- data.frame(ID,dummy1,dummy2,dummy3)
ID dummy1 dummy2 dummy3
1 a 1 0 1
2 b 0 0 0
3 b 1 0 0
4 c 1 0 1
5 c 0 1 1
6 c 0 1 0
7 d 1 1 0
8 d 1 1 1
9 d 0 1 1
I want to calculate the mean for each variable in a set of multiple dummy variables.
It would be like using tapply
, aggregate
or an ave(x,y,mean)
function on multiple rows, creating a new variable/column at the same time. Unfortunately, I don't know the number of dummy variables in advance. The only thing I know is that the dummy variables start in column 2. My result would look like this:
ID m_dummy1 m_dummy2 m_dummy3 m_dummy5...
a 1 0 1
b 0 0 0
c 0.33 0.66 0.66
d 0.66 1 0.66
or like this:
ID m_dummy1 m_dummy2 m_dummy3 m_dummy5...
a ... 1 0 1
b ... 0 0 0
b ... 0 0 0
c ... 0.33 0.66 0.66
c ... 0.33 0.66 0.66
c ... 0.33 0.66 0.66
d ... 0.66 1 0.66
d ... 0.66 1 0.66
d ... 0.66 1 0.66
In my scenario, I have an unknown number of dummies from 1 to x, so I might have dummy2 only, but maybe I have "dummy1" and the fictional dummies "dummy5" and "dummy6".
The perfect solution would allow me to create "m_dummy" columns for all columns after column 2.
Therefore, it would also work if dummy3 was missing or there was an additional dummy4 dummy4 <- c(1,0,0,0,0,0,0,1,0)
You could try summarise_each
or mutate_each
from dplyr
library(dplyr)
df %>%
group_by(ID) %>%
summarise_each(funs(mean), starts_with('dummy'))