Search code examples
rdplyrtidyverse

Sum row values if there are same strings in a column


I have a dataframe looking like

Sample Value Domain
S1 12 Domain_Identified_X13_A
S1 25 Domain_Identified_X28_B
S1 18 Domain_Unidentified

I want to aggregate values for the rows including string of "Identified" to have a final df

Sample Value Domain
S1 37 Domain_Identified
S1 18 Domain_Unidentified

Thank you


Solution

  • Using summarize from dplyr and sub to get the groups.

    library(dplyr)
    
    df %>% 
      group_by(Sample, Domain = sub("(.*dentified).*", "\\1", Domain)) %>% 
      summarize(Value = sum(Value), .groups="drop")
    # A tibble: 2 × 3
      Sample Domain              Value
      <chr>  <chr>               <int>
    1 S1     Domain_Identified      37
    2 S1     Domain_Unidentified    18
    

    Alternative a base R approach using aggregate

    setNames(aggregate(df$Value, by=
               list(df$Sample, sub("(.*dentified).*", "\\1", df$Domain)), \(x) 
                 sum(x)), c(colnames(df)[c(1,3)], "Value"))
      Sample              Domain Value
    1     S1   Domain_Identified    37
    2     S1 Domain_Unidentified    18