Search code examples
rdplyr

How can I retain only the most complete information rows per group in a dplyr pipe statement?


I have an input dataset, which has partially redundant information (i.e. info) per group (i.e. ID) in several rows. Some rows for some groups are more complete (i.e. contain additional _detail info), while some groups also never have the full information. I would like to retain the most complete information lines per ID group and information type and discard the others. Also, I would like to be able to do this in a single dplyr pipe without intermediate variables.

Input example:

data.frame(ID = c(1, 1, 1, 2, 3, 3, 3, 3),
           info = c("info1#info1_detail", "info1", "info3",
                    "info1",
                    "info1#info1_detail", "info1",
                    "info2#info2_detail", "info2"))

#>   ID               info
#> 1  1 info1#info1_detail
#> 2  1              info1
#> 3  1              info3
#> 4  2              info1
#> 5  3 info1#info1_detail
#> 6  3              info1
#> 7  3 info2#info2_detail
#> 8  3              info2

Desired output example:

data.frame(ID = c(1, 1, 2, 3, 3),
           info = c("info1#info1_detail",
                    "info3",
                    "info1",
                    "info1#info1_detail",
                    "info2#info2_detail"))

#>   ID               info
#> 1  1 info1#info1_detail
#> 2  1              info3
#> 3  2              info1
#> 4  3 info1#info1_detail
#> 5  3 info2#info2_detail

Solution

  • Assuming infoX and infoXdetail are always separated by "#" you could use regular expression's positive look ahead to group by ID and the part of info with the least information, e.g.:

    library(dplyr)
    
    x <- data.frame(ID = c(1, 1, 1, 2, 3, 3, 3, 3),
                    info = c("info1#info1_detail", "info1", "info3",
                             "info1",
                             "info1#info1_detail", "info1",
                             "info2#info2_detail", "info2"))
    x %>% 
      arrange(ID, desc(info)) %>% 
      group_by(ID, stringr::str_extract(info, "(.*?)(?=\\#|$)")) %>% 
      summarize(info = first(info), .groups = "drop") %>%
      select(ID, info)