Search code examples
rgroup-byconcatenationdplyr

Concatenating words without quotes in R for groupby in dplyr


I have datasets that involve large number of column joins (8-12) and at the same time depending upon the circumstance 1-3 of these columns may not be needed.

Presently I have been writing out these long group-bys using dplyr but with so many columns and changing situations, it is easy to misspell or forget a column.

I'd like to somehow create a variable that goes along with doing this, but I haven't been able to figure out how to due to the quotes that are present when I try to use paste. Can anyone show me a quick example of how to do this?

For example:

  library(dplyr)

    # I want this group-list not to have quotes so I can drop in my group_by below
   my_group_list = paste0("vs"," ","am") #quotes get in the way

   mtcars %>% group_by(my_group_list) %>% summarise(countofvalues = n())

Solution

  • If there are many columns, we can specify the columns to group from directly subsetting the column names. In that case, use group_by_

    library(dplyr)
    mtcars %>%
         group_by_(.dots=names(.)[8:9]) %>% 
         summarise(countofvalues = n())
    #     vs    am countofvalues
    #   (dbl) (dbl)         (int)
    #1     0     0            12
    #2     0     1             6
    #3     1     0             7
    #4     1     1             7
    

    The above also works if we have a vector of values

    my_group_list <- c("vs", "am")
    mtcars %>%
          group_by_(.dots = my_group_list) %>%
          summarise(countofvalues = n())
    #    vs    am countofvalues
    #  (dbl) (dbl)         (int)
    #1     0     0            12
    #2     0     1             6
    #3     1     0             7
    #4     1     1             7
    

    As the OP mentioned that it is not doing the grouping, we can test it by uniteing the 'vs' and 'am' columns, use it as grouping variable and then do the n().

    library(tidyr)
    mtcars %>%
          unite(vs_am, vs, am) %>%
          group_by(vs_am) %>% 
          summarise(countofvalues = n())
    #  vs_am countofvalues
    #  (chr)         (int)
    #1   0_0            12
    #2   0_1             6
    #3   1_0             7
    #4   1_1             7