Search code examples
rdatatable

aggregating and counting data table column based on condition


I have a dataframe which looks like

structure(list(treaty = c("abc", "de", "abc", "de", "de"), SEX1 = c("M", 
"F", "F", "M", "F"), age = c(20, 30, 40, 50, 70)), class = "data.frame", row.names = c(NA, 
-5L))

I convert this into a datatable as the size is huge and perform operations on it. i want to be able to aggregate and calculate to get the following output on the data table

Output:

treaty Prop (M) Prop (F) Avg Age
abc 1/2 1/2 30
de 1/3 2/3 50

basically 1. group by treaty id 2. show proportion of M and F based on each treaty total for the datatable and average age.

Thanks in advance


Solution

  • Assigning your dput structure as df, using data.table you can do:

    dt <- data.table::data.table(df)
    
    dt[, .(propM = mean(SEX1 == "M"), 
           propF = mean(SEX1 == "F"), 
           AveAge = mean(age)),
       .(treaty)]
    

    Output

       treaty     propM     propF AveAge
    1:    abc 0.5000000 0.5000000     30
    2:     de 0.3333333 0.6666667     50
    

    Or using dplyr you could simply summarize all of these operations by treaty:

    df %>%
      summarize(propM = mean(SEX1 == "M"),
                propM = mean(SEX1 == "F"),
                AveAge = mean(age),
                .by = treaty)
    

    Output:

      treaty     propM     propF AveAge
    1    abc 0.5000000 0.5000000     30
    2     de 0.3333333 0.6666667     50
    

    It looks like in this case, dplyr may be a bit faster:

    n <- 1e6
    set.seed(123)
    
    test_df <- data.frame(treaty = sample(c("abc", "de"), n, replace = TRUE),
                          SEX1 = sample(c("M", "F"), n, replace = TRUE),
                          age = sample(10:50, n, replace = TRUE))
    test_dt <- data.table::data.table(test_df)
    
    microbenchmark::microbenchmark(
      datatable = test_dt[, .(propM = mean(SEX1 == "M"), 
                         propF = mean(SEX1 == "F"), 
                         AveAge = mean(age)),
                     .(treaty)],
      dplyr = test_df %>%
        summarize(propM = mean(SEX1 == "M"),
                  propF = mean(SEX1 == "F"),
                  AveAge = mean(age),
                  .by = treaty)
    )
    
    #Unit: milliseconds
    #      expr      min       lq     mean   median       uq      max neval
    # datatable 52.12843 59.04811 76.12773 64.93694 77.99761 364.6247   100
    #     dplyr 40.47459 50.06535 62.74962 57.95659 65.37164 313.1679   100