Search code examples
rdummy-data

Perform operations on multiple dummy variables


Given a dataframe,

ID <- c("a","b","b","c","c","c","d","d","d")
dummy1 <- c(1,0,1,1,0,0,1,1,0)
dummy2 <- c(0,0,0,0,1,1,1,1,1)
dummy3 <- c(1,0,0,1,1,0,0,1,1)
df <- data.frame(ID,dummy1,dummy2,dummy3)

  ID dummy1 dummy2 dummy3
1  a      1      0      1
2  b      0      0      0
3  b      1      0      0
4  c      1      0      1
5  c      0      1      1
6  c      0      1      0
7  d      1      1      0
8  d      1      1      1
9  d      0      1      1

I want to calculate the mean for each variable in a set of multiple dummy variables.

It would be like using tapply, aggregate or an ave(x,y,mean) function on multiple rows, creating a new variable/column at the same time. Unfortunately, I don't know the number of dummy variables in advance. The only thing I know is that the dummy variables start in column 2. My result would look like this:

ID     m_dummy1  m_dummy2  m_dummy3   m_dummy5...
a      1         0         1
b      0         0         0
c      0.33      0.66      0.66
d      0.66      1         0.66

or like this:

ID     m_dummy1  m_dummy2  m_dummy3   m_dummy5...
a ...  1         0         1
b ...  0         0         0
b ...  0         0         0    
c ...  0.33      0.66      0.66
c ...  0.33      0.66      0.66
c ...  0.33      0.66      0.66
d ...  0.66      1         0.66    
d ...  0.66      1         0.66
d ...  0.66      1         0.66

In my scenario, I have an unknown number of dummies from 1 to x, so I might have dummy2 only, but maybe I have "dummy1" and the fictional dummies "dummy5" and "dummy6". The perfect solution would allow me to create "m_dummy" columns for all columns after column 2. Therefore, it would also work if dummy3 was missing or there was an additional dummy4 dummy4 <- c(1,0,0,0,0,0,0,1,0)


Solution

  • You could try summarise_each or mutate_each from dplyr

    library(dplyr)
    df %>% 
        group_by(ID) %>% 
        summarise_each(funs(mean), starts_with('dummy'))