Search code examples
rtidyversesummarize

Summing rows with the same value across multiple distinct columns


I have a data table structured like this:

Molecular Formula AHBG13 AHBG15 AHBG16  AHBG19 AHBG20
C23H48NO7P  1767    4570    7052    8831    7921
C24H50NO7P  0   0   0   73331   0
C24H50NO7P  45335   48281   0   0   77333
C24H50NO7P  0   0   48537   0   0
C25H44NO7P  4190    3166    0   4903    6609
C26H48NO7P  5110    3102    0   8939    7772
C26H50NO7P  0   3021    1981    7770    5357
C26H52NO7P  6377    0   0   0   13176
C26H52NO7P  0   8245    7883    20806   0
C26H54NO7P  12049   27408   35437   45506   44353
C26H54NO7P  0   0   0   2123    2367
C26H54NO7P  0   1307    0   0   0
C28H50NO7P  0   6503    0   21058   10462
C28H50NO7P  7547    0   0   0   0
C28H50NO7P  6414    0   4990    0   0
C28H54NO7P  0   0   863 0   6411
C28H54NO7P  2107    4732    0   6778    0
C30H54NO7P  0   0   2185    8351    5841
C30H54NO7P  3809    0   0   0   0
C30H54NO7P  0   3886    0   0   0

where the first column is the molecular formula for a lipid, and the following columns are the concetrations of each lipid for my specific samples. However, there are repeats of certain lipids due to isomers of the same molecule being present, requiring me to sum these values using r. Is there a way to sum the concentrations of identical lipids across all samples? I'd like to create something like this:

Molecular Formula AHBG13 AHBG15 AHBG16 AHBG19 AHBG20
C23H48NO7P  1767    4570    7052    8831    7921
C24H50NO7P  45335   48281   48537   73331   77333
C25H44NO7P  4190    3166    0   4903    6609
C26H48NO7P  5110    3102    0   8939    7772
C26H50NO7P  0   3021    1981    7770    5357
C26H52NO7P  6377    8245    7883    20806   13176
C26H54NO7P  12049   28715   35437   47629   46720

Optimally something that can be added into a pipeline, but I'll take whatever I can get.

I tried using a various variations of summarize() and summarize_all(), but wasn't able to find a syntax that produces the desired result. I could use the summarize function to do what I want for each sample individually, with the amount of samples and data that I have it would save a significant amount of time if I could apply the function to multiple samples at once.

The closest I got was using

summarise(across(c("AHBG13","AHBG15","AHBG16,"AHBG19","AHBG20"), sum))

which resulted in the sum of all of the concentrations, like this:

                    AHBG13  AHBG15  AHBG16  AHBG19  AHBG20
Molecular Formula   94705   114221  108928  208396  187602

Solution

  • I think you should use .by = ..., e.g.,

    df %>%
      summarise(across(everything(), sum), .by = MolecularFormula)
    

    or simply applies aggregate

    aggregate(. ~ MolecularFormula, df, sum)
    

    which gives

       MolecularFormula AHBG13 AHBG15 AHBG16 AHBG19 AHBG20
    1        C23H48NO7P   1767   4570   7052   8831   7921
    2        C24H50NO7P  45335  48281  48537  73331  77333
    3        C25H44NO7P   4190   3166      0   4903   6609
    4        C26H48NO7P   5110   3102      0   8939   7772
    5        C26H50NO7P      0   3021   1981   7770   5357
    6        C26H52NO7P   6377   8245   7883  20806  13176
    7        C26H54NO7P  12049  28715  35437  47629  46720
    8        C28H50NO7P  13961   6503   4990  21058  10462
    9        C28H54NO7P   2107   4732    863   6778   6411
    10       C30H54NO7P   3809   3886   2185   8351   5841