Search code examples
rstringsum

Sum values based on their string ID


I have a data frame that consists of comma-separated sequences of strings. For example:

df <- data.frame(patterns = c("CCDC127, HSF1, NDUFB9", "CCDC127, EXOC3, YIF1A", "EXOC3, NDUFB9, YIF1A"))
df
               patterns
1 CCDC127, HSF1, NDUFB9
2 CCDC127, EXOC3, YIF1A
3  EXOC3, NDUFB9, YIF1A

I have another data frame, where each string corresponds to numerical value. For example:

df2 <- data.frame(strings = c("CCDC127", "HSF1", "NDUFB9", "EXOC3", "YIF1A"),
                   scores = c(10, 11, 12, 13, 14))
df2
  strings scores
1 CCDC127     10
2    HSF1     11
3  NDUFB9     12
4   EXOC3     13
5   YIF1A     14

I would like to calculate a sum of each pattern from the first data frame based on values in the second data frame. For example:

patterns sum
1 CCDC127, HSF1, NDUFB9  33
2 CCDC127, EXOC3, YIF1A  37
3  EXOC3, NDUFB9, YIF1A  39

I would appreciate any directions and help with this question.

Thank you! Olha


Solution

  • You can use strsplit and sapply with match:

    df$sum <- sapply(strsplit(df$patterns, ", "), 
                     function(x) sum(df2$scores[match(x, df2$strings)]))
    df
    #>                patterns sum
    #> 1 CCDC127, HSF1, NDUFB9  33
    #> 2 CCDC127, EXOC3, YIF1A  37
    #> 3  EXOC3, NDUFB9, YIF1A  39