I have a dataframe which looks like
structure(list(treaty = c("abc", "de", "abc", "de", "de"), SEX1 = c("M",
"F", "F", "M", "F"), age = c(20, 30, 40, 50, 70)), class = "data.frame", row.names = c(NA,
-5L))
I convert this into a datatable as the size is huge and perform operations on it. i want to be able to aggregate and calculate to get the following output on the data table
Output:
treaty | Prop (M) | Prop (F) | Avg Age |
---|---|---|---|
abc | 1/2 | 1/2 | 30 |
de | 1/3 | 2/3 | 50 |
basically 1. group by treaty id 2. show proportion of M and F based on each treaty total for the datatable and average age.
Thanks in advance
Assigning your dput
structure as df
, using data.table
you can do:
dt <- data.table::data.table(df)
dt[, .(propM = mean(SEX1 == "M"),
propF = mean(SEX1 == "F"),
AveAge = mean(age)),
.(treaty)]
Output
treaty propM propF AveAge
1: abc 0.5000000 0.5000000 30
2: de 0.3333333 0.6666667 50
Or using dplyr
you could simply summarize
all of these operations by treaty
:
df %>%
summarize(propM = mean(SEX1 == "M"),
propM = mean(SEX1 == "F"),
AveAge = mean(age),
.by = treaty)
Output:
treaty propM propF AveAge
1 abc 0.5000000 0.5000000 30
2 de 0.3333333 0.6666667 50
It looks like in this case, dplyr
may be a bit faster:
n <- 1e6
set.seed(123)
test_df <- data.frame(treaty = sample(c("abc", "de"), n, replace = TRUE),
SEX1 = sample(c("M", "F"), n, replace = TRUE),
age = sample(10:50, n, replace = TRUE))
test_dt <- data.table::data.table(test_df)
microbenchmark::microbenchmark(
datatable = test_dt[, .(propM = mean(SEX1 == "M"),
propF = mean(SEX1 == "F"),
AveAge = mean(age)),
.(treaty)],
dplyr = test_df %>%
summarize(propM = mean(SEX1 == "M"),
propF = mean(SEX1 == "F"),
AveAge = mean(age),
.by = treaty)
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# datatable 52.12843 59.04811 76.12773 64.93694 77.99761 364.6247 100
# dplyr 40.47459 50.06535 62.74962 57.95659 65.37164 313.1679 100