Search code examples
rdplyrtidyversesummarize

How to summarize multiple columns at once?


I am trying to create a table with the Sample ID, raw counts, and Gene names.

In this table, a new row is created for each Sample Id to correspond to each gene name:

Sample ID Gene A Gene B
Sample 1 1 -
Sample 1 - 2
Sample 2 3 -
Sample 2 - 4

Rather than having numerous rows, I would like to condense them into a single row:

Sample ID Gene A Gene B
Sample 1 1 2
Sample 2 3 4

This is the code I have so far:

dfwide = data.wide.df %>% group_by(SampleId) %>%
summarise(Sample 1 = sum(Sample 1, na.rm = T),
Sample 2 = sum(Sample 2, na.rm = T))

I have over 1000 samples, so I was hoping to find a way to summarise all the genes at once. Any help would be appreciated!


Solution

  • If you are always guaranteed to have the same number of Gene A as Gene B, then this might work:

    library(dplyr)
    dat %>%
      group_by(Sample.ID) %>%
      summarize(across(starts_with("Gene"), ~ .[. != "-"]))
    # # A tibble: 2 x 3
    #   Sample.ID Gene.A Gene.B
    #   <chr>     <chr>  <chr> 
    # 1 Sample 1  1      2     
    # 2 Sample 2  3      4     
    

    I've assumed that you have literal "-" strings; if they are NA or empty "", then that conditional can be modified to account for that.

    The risk here is if there are uneven numbers of genes. For example, if the data were instead

    dat2
    #   Sample.ID Gene.A Gene.B
    # 1  Sample 1      1      -
    # 2  Sample 1      -      2
    # 5  Sample 1      -      3
    # 3  Sample 2      3      -
    # 4  Sample 2      -      4
    
    dat2 %>%
      group_by(Sample.ID) %>%
      summarize(across(starts_with("Gene"), ~ .[. != "-"]))
    # # A tibble: 3 x 3
    # # Groups:   Sample.ID [2]
    #   Sample.ID Gene.A Gene.B
    #   <chr>     <chr>  <chr> 
    # 1 Sample 1  1      2     
    # 2 Sample 1  1      3     
    # 3 Sample 2  3      4     
    

    You see how the 1 is repeated in multiple rows; this worked this time without error because of R's "recycling": since the number of valid strings in Gene.B is a perfect multiple of the number of valid strings in Gene.A, there is no complaint and the values are repeated. I consider recycling here to likely be improper, so perhaps not what you need.

    If this is the case, it might be more appropriate to store this in "long" format:

    dat %>%
      tidyr::pivot_longer(-Sample.ID, names_to = "Gene", names_pattern = "Gene\\.(.*)", values_to = "Value") %>%
      filter(Value != "-")
    # # A tibble: 4 x 3
    #   Sample.ID Gene  Value
    #   <chr>     <chr> <chr>
    # 1 Sample 1  A     1    
    # 2 Sample 1  B     2    
    # 3 Sample 2  A     3    
    # 4 Sample 2  B     4    
    dat2 %>%
      tidyr::pivot_longer(-Sample.ID, names_to = "Gene", names_pattern = "Gene\\.(.*)", values_to = "Value") %>%
      filter(Value != "-")
    # # A tibble: 5 x 3
    #   Sample.ID Gene  Value
    #   <chr>     <chr> <chr>
    # 1 Sample 1  A     1    
    # 2 Sample 1  B     2    
    # 3 Sample 1  B     3    
    # 4 Sample 2  A     3    
    # 5 Sample 2  B     4    
    

    which will likely require you to refactor downstream processing, but at least it is safe.


    Data:

    dat <- structure(list(Sample.ID = c("Sample 1", "Sample 1", "Sample 2", "Sample 2"), Gene.A = c("1", "-", "3", "-"), Gene.B = c("-", "2", "-", "4")), class = "data.frame", row.names = c(NA, -4L))
    dat2 <- structure(list(Sample.ID = c("Sample 1", "Sample 1", "Sample 1", "Sample 2", "Sample 2"), Gene.A = c("1", "-", "-", "3", "-"), Gene.B = c("-", "2", "3", "-", "4")), row.names = c(1L, 2L, 5L, 3L, 4L), class = "data.frame")