Search code examples
rstringsummatchtidy

How to perform a R untidy sum?


Update

I have untidy data that looks like the example df2.

df2 = data.frame(label = c("C360", "C360~C183", "C157~C360", "C183", "C157", "C157~C183", "C195~C183"), 
                 values = c(10, 11, 12, 13, 14, 15, 16))
> df2
      label values
1      C360     10
2 C360~C183     11
3 C157~C360     12
4      C183     13
5      C157     14
6 C157~C183     15
7 C195~C183     16

I need to sum partial matches that include the letters in different positions, so that I will the following desired output:

  string sum
1   C360  33
2   C183  55
3   C157  41
4   C195  16

Old

df1 = data.frame(label = c("a", "a.1", "2.a", "b", "c"), 
                 values = c(10, 11, 12, 13, 14))
> df1
  label values
1     a     10
2   a.1     11
3   2.a     12
4     b     13
5     c     14

Expected output:

   label sum
1      a  33
2      b  13
3      c  14

Solution

  • New

    stack(setNames(strsplit(df2$label, "~"), df2$values)) |>
      type.convert(as.is = TRUE) |>
      `colnames<-`(names(df2)) |>
      aggregate(values ~ label, sum)
    
      label values
    1  C157     41
    2  C183     55
    3  C195     16
    4  C360     33
    

    Data:

    df2 = data.frame(label = c("C360", "C360~C183", "C157~C360", "C183", "C157", "C157~C183", "C195~C183"), 
                     values = c(10, 11, 12, 13, 14, 15, 16))
    

    Old

    For df1, you might consider

    aggregate(df1$values, list(letter = gsub("[^a-z]", "", df1$label)), sum)
    
      letter  x
    1      a 33
    2      b 13
    3      c 14
    

    where the appropriate regex (something more concrete than "[^a-z]") depends on your letter (labels) column.