I have a very similar problem to solve like this. However, I am not interested in sorting. I am interested in grouping (wrong word perhaps) the same string objects together and sum the value attached with string. Secondly, I would like to remove a string from the rows. I have prepared an example data frame. I have prepared this as close as the post I refered earlier.
branch <- c("OL", "CA", "PL", "OR", "FL")
perf <- c("Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ",
"Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)",
"Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)",
"Harry (1), Potter (32), Harry (2)",
"Timothy (3), HQ, Sara (44), HQ"
)
> performance <- data.frame(branch, perf)
> performance
branch perf
1 OL Mattheu (12), Jessica (32), Mattheu (22), Tom (10), HQ
2 CA Tobias (13), Kurt (22), Mathias (44), HQ, Tobias (55)
3 PL Tom (30), HQ, Giti (88), Patel (54), Tom (12), Tom (10)
4 OR Harry (1), Potter (32), Harry (2)
5 FL Timothy (3), HQ, Sara (44), HQ
In the first row, I have Mattheu two times. I want to have it once with the numbers summed up. That means, it should be Mattheu (34). Secondly, I would like to have the string HQ removed.
This is the output expectation of the second column:
[1] "Mattheu (34), Jessica (32), Tom (10)"
[2] "Tobias (68), Kurt (22), Mathias (44)"
[3] "Tom (52), Giti (88), Patel (54)"
[4] "Harry (3), Potter (32)"
[5] "Timothy (3), Sara (44)"
How to get the expected output?
Here's an option using dplyr
and tidyr
library.
library(dplyr)
library(tidyr)
performance %>%
separate_longer_delim(perf, ", ") %>%
filter(perf != "HQ") %>%
separate_wider_regex(perf,
c(name = "[A-Za-z]+", "\\s+\\(", score = "\\d+", "\\)")) %>%
type.convert(as.is = TRUE) %>%
summarise(score = sum(score), .by = c(branch, name)) %>%
summarise(perf = paste(name, '(', score, ')', collapse = ","), .by = branch)
# A tibble: 5 × 2
# branch perf
# <chr> <chr>
#1 OL Mattheu ( 34 ),Jessica ( 32 ),Tom ( 10 )
#2 CA Tobias ( 68 ),Kurt ( 22 ),Mathias ( 44 )
#3 PL Tom ( 52 ),Giti ( 88 ),Patel ( 54 )
#4 OR Harry ( 3 ),Potter ( 32 )
#5 FL Timothy ( 3 ),Sara ( 44 )
separate_longer_delim
based on ", "name
and score
). The regex used here is important to correctly identify the name
and score
values.sum
the values for each name
and branch
name
to get it in original format.