I have a list of three data frames and would like to generate another list of three data frames whose rows consist of each of the values of a grouping variable (g1) and the means of six variables by the g1 variable. The twist is that I would like to calculate the means for the three continuous variables only when the value of the corresponding dummy variable equals 1.
Reproducible example:
a <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),c(1,1,1,1,0,0,0,1,0,0),c(0,0,1,0,1,0,0,1,0,1),c(0,0,0,1,0,0,1,1,0,0),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
b <- data.frame(c("fj","a","fj","a","fj","fj","fj","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
c <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
u <- list(a,b,c)
u <- lapply(u, setNames, nm = c('g1','dummy1','dummy2','dummy3','contin1','contin2','contin3'))
u[[1]]
> u
[[1]]
g1 dummy1 dummy2 dummy3 contin1 contin2 contin3
1 fj 1 0 0 199 18 61
2 fj 1 0 0 91 158 28
3 fj 1 1 0 147 67 190
4 a 1 0 1 181 105 22
5 fj 0 1 0 14 16 156
6 a 0 0 0 178 14 98
7 g 0 0 1 116 97 30
8 g 1 1 1 48 31 144
9 g 0 0 0 60 21 112
10 g 0 1 0 95 145 199
I would like to calculate the mean of contin1 only when dummy1 = 1, mean of contin2 only when dummy2 = 1, and mean of contin3 only when dummy3 = 1
The output I WANT for the first list:
> rates
[[1]]
x[, 1] V1 V2 V3 x[, 1] x[, 6] x[, 1] x[, 7] x[, 1] x[, 8]
1 a 0.50 0.0 0.5 a 181 a NA a 22
2 fj 0.75 0.5 0.0 fj 145.67 fj 41.5 fj NA
3 g 0.25 0.5 0.5 g 48 g 88 g 87
What I have tried:
rates <- lapply(u, function(x) {
cbind(aggregate(cbind(x[,2],x[,3],x[,4]) ~ x[,1], FUN = mean, na.action = NULL),
aggregate(x[,6] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,2] == 1)),
aggregate(x[,7] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,3] == 1)),
aggregate(x[,8] ~ x[,1], FUN = mean, na.action = NULL, subset = (x[,4] == 1)))
})
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 3, 2
I understand that this error is coming from cbind, because cbind fails whenever you try to cbind objects with different numbers of rows. (The column x[, 6] has three rows whereas x[, 7] and x[, 8] have two.) I guess I was hoping that there was some way for aggregate to keep one row per grouping variable, which would mean that I would have the same number of rows and the cbind would work. Perhaps this is not possible per the R documentation?: "Rows with missing values in any of the by variables will be omitted from the result."
I have cafefully read the documentation for aggregate. The following two posts address similar issues but not using different subsets of the data to calculate the means.
R: Calculate means for subset of a group and Means from a list of data frames in R
Any suggestions would be hugely appreciated.
If you have dplyr installed, the following code seems to solve your problem.
library(dplyr)
set.seed(1234)
a <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),c(1,1,1,1,0,0,0,1,0,0),c(0,0,1,0,1,0,0,1,0,1),c(0,0,0,1,0,0,1,1,0,0),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
b <- data.frame(c("fj","a","fj","a","fj","fj","fj","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
c <- data.frame(c("fj","fj","fj","a","fj","a","g","g","g","g"),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 0, max = 2)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)),floor(runif(10, min = 10, max = 200)))
u <- list(a,b,c)
u <- lapply(u, setNames, nm = c('g1','dummy1','dummy2','dummy3','contin1','contin2','contin3'))
rates <- lapply(u, function(x)
x %>%
mutate( contin1_ = ifelse(dummy1==1, contin1, NA) ) %>%
mutate( contin2_ = ifelse(dummy2==1, contin2, NA) ) %>%
mutate( contin3_ = ifelse(dummy3==1, contin3, NA) ) %>%
group_by(g1) %>%
summarize(
V1 = mean(dummy1, na.rm=TRUE),
V2 = mean(dummy2, na.rm=TRUE),
V3 = mean(dummy3, na.rm=TRUE),
mean1 = mean(contin1_, na.rm=TRUE),
mean2 = mean(contin2_, na.rm=TRUE),
mean3 = mean(contin3_, na.rm=TRUE)
)
)
print(rates[[1]])
Which gives me this:
Source: local data frame [3 x 7]
g1 V1 V2 V3 mean1 mean2 mean3
1 a 0.50 0.0 0.5 128.00000 NaN 17
2 fj 0.75 0.5 0.0 94.66667 64 NaN
3 g 0.25 0.5 0.5 54.00000 57 146
The number that I get seem to be approximately correct and NA are in all the right places. Unfortunately, your example is not fully reproducible since you did not specify the seed for generating random variables and thus, my runif gives me different values than yours.