I have an input dataset, which has partially redundant information (i.e. info
) per group (i.e. ID
) in several rows. Some rows for some groups are more complete (i.e. contain additional _detail
info), while some groups also never have the full information. I would like to retain the most complete information lines per ID
group and information type and discard the others. Also, I would like to be able to do this in a single dplyr
pipe without intermediate variables.
Input example:
data.frame(ID = c(1, 1, 1, 2, 3, 3, 3, 3),
info = c("info1#info1_detail", "info1", "info3",
"info1",
"info1#info1_detail", "info1",
"info2#info2_detail", "info2"))
#> ID info
#> 1 1 info1#info1_detail
#> 2 1 info1
#> 3 1 info3
#> 4 2 info1
#> 5 3 info1#info1_detail
#> 6 3 info1
#> 7 3 info2#info2_detail
#> 8 3 info2
Desired output example:
data.frame(ID = c(1, 1, 2, 3, 3),
info = c("info1#info1_detail",
"info3",
"info1",
"info1#info1_detail",
"info2#info2_detail"))
#> ID info
#> 1 1 info1#info1_detail
#> 2 1 info3
#> 3 2 info1
#> 4 3 info1#info1_detail
#> 5 3 info2#info2_detail
Assuming infoX and infoXdetail are always separated by "#" you could use regular expression's positive look ahead to group by ID and the part of info with the least information, e.g.:
library(dplyr)
x <- data.frame(ID = c(1, 1, 1, 2, 3, 3, 3, 3),
info = c("info1#info1_detail", "info1", "info3",
"info1",
"info1#info1_detail", "info1",
"info2#info2_detail", "info2"))
x %>%
arrange(ID, desc(info)) %>%
group_by(ID, stringr::str_extract(info, "(.*?)(?=\\#|$)")) %>%
summarize(info = first(info), .groups = "drop") %>%
select(ID, info)