Search code examples
rdplyrdata.tablestringr

Grouping string together and sum it together


I have a very similar problem to solve like this. However, I am not interested in sorting. I am interested in grouping (wrong word perhaps) the same string objects together and sum the value attached with string. Secondly, I would like to remove a string from the rows. I have prepared an example data frame. I have prepared this as close as the post I refered earlier.

branch <- c("OL", "CA", "PL", "OR", "FL")
perf <- c("Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ", 
          "Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)",
          "Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)",
          "Harry (1), Potter (32), Harry (2)",
          "Timothy (3), HQ, Sara (44), HQ"
          )
> performance <- data.frame(branch, perf)
> performance
  branch                                                    perf
1     OL  Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ
2     CA   Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)
3     PL Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)
4     OR                       Harry (1), Potter (32), Harry (2)
5     FL                          Timothy (3), HQ, Sara (44), HQ

In the first row, I have Mattheu two times. I want to have it once with the numbers summed up. That means, it should be Mattheu (34). Secondly, I would like to have the string HQ removed.

This is the output expectation of the second column:

[1] "Mattheu (34), Jessica (32), Tom (10)"
[2] "Tobias (68), Kurt (22), Mathias (44)"
[3] "Tom (52), Giti (88), Patel (54)"     
[4] "Harry (3), Potter (32)"              
[5] "Timothy (3), Sara (44)"

How to get the expected output?


Solution

  • Here's an option using dplyr and tidyr library.

    library(dplyr)
    library(tidyr)
    
    performance %>%
      separate_longer_delim(perf, ", ") %>%
      filter(perf != "HQ") %>%
      separate_wider_regex(perf, 
                          c(name = "[A-Za-z]+", "\\s+\\(", score = "\\d+", "\\)")) %>%
      type.convert(as.is = TRUE) %>%
      summarise(score = sum(score), .by = c(branch, name)) %>%
      summarise(perf = paste(name, '(', score, ')', collapse = ","), .by = branch)
    
    # A tibble: 5 × 2
    #  branch perf                                    
    #  <chr>  <chr>                                   
    #1 OL     Mattheu ( 34 ),Jessica ( 32 ),Tom ( 10 )
    #2 CA     Tobias ( 68 ),Kurt ( 22 ),Mathias ( 44 )
    #3 PL     Tom ( 52 ),Giti ( 88 ),Patel ( 54 )     
    #4 OR     Harry ( 3 ),Potter ( 32 )               
    #5 FL     Timothy ( 3 ),Sara ( 44 ) 
    
    1. Split the data in separate rows using separate_longer_delim based on ", "
    2. remove the "HQ" rows
    3. separate the name and number in two different columns (name and score). The regex used here is important to correctly identify the name and score values.
    4. sum the values for each name and branch
    5. combine the rows for each name to get it in original format.