Search code examples
rdplyrsummarize

In grouped dataframe, summarize rows which contain certain value (e.g. zeros only) in set of columns with common substring in header [R]


Given the following dataframe:

dframe <- structure(list(id = c("294361-7349174-75411122", "294365-7645230-95464222", 
"291915-7345264-75464222", "291365-7345074-75164202", "594165-7345274-78444212", 
"234385-7335274-75464229", "734515-1345274-95464892", "201365-8345274-78464232", 
"294365-7315971-75464120", "591365-7345374-75464222", "394365-7345204-75411022", 
"494305-7345273-75464222", "291161-7345271-75461210", "294035-7345201-75464292", 
"298365-7345279-78864223", "294365-7345274-15964293", "294395-7345274-69464299", 
"899965-1345294-95464222", "194365-7145274-75464222", "194361-7349231-75464222", 
"294365-7345274-75464122", "191315-1345274-13464322", "794365-7349274-75464292", 
"214365-8318274-75464222", "394363-8341274-39494929"), gene = structure(c(3L, 
3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("ABC_1", "C_1", "XYZ_123"
), class = "factor"), group = structure(c(2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L), .Label = c("KO", "WT"), class = "factor"), class_A = c(0, 
1, 0, 2, 1, 0, 0, 1, 0, 1, 0, 0, 0, 2, 2, 1, 0, 0, 0, 0, 1, 1, 
1, 0, 3), class_B = c(0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 
1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1)), row.names = c(NA, -25L), class = "data.frame")

I would like to produce a new dataframe for data grouped per "group" and "gene". I want to output sum of rows per group in which both class_A and class_B columns contain the same, desired value - now I am interested in zeros.

Based on the answers provided in this thread: Efficient way to create a dataframe with multiple summary columns based on a grouped dataframe using dplyr in R

I can achieve this with following code:

desired_dframe <- dframe %>% 
  group_by(group, gene) %>%
  summarise(counts_zero = sum(ifelse((class_A == 0 & class_B == 0), 1, 0))) 

However, the above approach has one pitfall: the column names are hardcoded. In real life, I have dataframes with various number of columns denoting classes (and other names, e.g. "class_C", "class_Z" etc.). The common part of their names, is "class_". Based on this, I would like to consider all of the columns of interest.

I was playing around with rowSums(dplyr::across(dplyr::starts_with('class_')==0)) to achieve this, yet with no avail. The function throws the error and I have no idea how to debug it.

Also, I was trying to incorporate this column into the @akrun's answer provided here: Efficient way to create a dataframe with multiple summary columns based on a grouped dataframe using dplyr in R

On the @akrun's request, I am putting this into the new thread.


Solution

  • If it is to get the sum of class_ columns, use across or if_all (more correct) i.e. loop over the class_ columns in if_all, apply the condition .x ==0, which returns TRUE only if all the columns looped for that rows will be 0 or else it return FALSE. Do the sum directly on the logical vector (TRUE -> 1 and FALSE -> 0)

    library(dplyr)
    dframe %>% 
      group_by(group, gene) %>% 
      summarise(counts_zero = sum(if_all(starts_with('class_'),
         ~ .x == 0)), .groups = 'drop')
    

    -output

    # A tibble: 5 × 3
      group gene    counts_zero
      <fct> <fct>         <int>
    1 KO    ABC_1             2
    2 KO    XYZ_123           4
    3 WT    ABC_1             2
    4 WT    C_1               0
    5 WT    XYZ_123           1