Search code examples
rregexdata.tabletidyversesummarize

How to group_by and summarize multiple variables using regex?


I want to use regex to identify the variable to use to group_by and to summarize my data efficiently. I cannot do separately because I have a large number of variables to summarize and the variable to group_by needs to be passed dynamically each time. data.table accepts using regex to pass the grouping variable, but not the summarizing variables. My attempts so far using tidyverse have been unsuccessful as well. Any help would be much appreciated.

My data:

    tempDF <- structure(list(d1 = c("A", "B", "C", "A", "C"), d2 = c(40L, 50L, 20L, 50L, 20L), 
        d3 = c(20L, 40L, 50L, 40L, 50L), d4 = c(60L, 30L, 30L,60L, 30L), p_A = c(1L, 
        3L, 2L, 3L, 2L), p_B = c(3L, 4L, 3L, 3L, 4L), p_C = c(2L, 1L, 1L,2L, 1L), p4 = c(5L, 
        5L, 4L, 5L, 4L)), class = "data.frame", row.names = c(NA, -5L))

    View(tempDF)    
    lLevels<-c("d1")
    lContinuum<-c("p_A", "p_B", "p_C")


My attempts:

    setDT(tempDF)[ , list(group_means = mean(eval((paste0(lContinuum)))), by=eval((paste0(lLevels))))] 
       group_means by
    1:          NA d1
    Warning message:
    In mean.default(eval((paste0(lContinuum)))) :
      argument is not numeric or logical: returning NA

    But a single variable works:
    setDT(tempDF)[ , list(group_means = mean(p_A)), by=eval((paste0(lLevels)))]                                            
    setDT(tempDF)[ , list(group_means = mean(p_B)), by=eval((paste0(lLevels)))]                                            
    setDT(tempDF)[ , list(group_means = mean(p_C)), by=eval((paste0(lLevels)))]                                            


Expected output:

    tempDF %>%
    group_by(d1) %>%
    summarise(p_A_mean = mean(p_A), p_B_mean = mean(p_B), p_C_mean = mean(p_C))

    # A tibble: 3 x 4
      d1    p_A_mean p_B_mean p_C_mean
      <chr>    <dbl>    <dbl>    <dbl>
    1 A            2      3          2
    2 B            3      4          1
    3 C            2      3.5        1

Solution

  • The approach is very simple:

    library(data.table)
    
    setDT(tempDF)
    
    tempDF[, lapply(.SD, mean),
             by = lLevels,
            .SDcols = lContinuum]
    
       d1 p_A p_B p_C
    1:  A   2 3.0   2
    2:  B   3 4.0   1
    3:  C   2 3.5   1
    

    Similar approach in would be:

    library(dplyr)
    tempDF%>%
      group_by_at(lLevels)%>%
      summarize_at(lContinuum, mean)
    
    # A tibble: 3 x 4
      d1      p_A   p_B   p_C
      <chr> <dbl> <dbl> <dbl>
    1 A         2   3       2
    2 B         3   4       1
    3 C         2   3.5     1
    

    In either case, you can replace lLevels and lContinuum with regex. The option also would allow for select helpers such as starts_with() and ends_with():

    https://www.rdocumentation.org/packages/tidyselect/versions/0.2.5/topics/select_helpers .