Search code examples
rdplyrtibble

Remove only the rows failed a test inside a group where a row passed the test


Background: I am using R to help me figure out which entries in file A (expected names) map to which entries in file B (possible names). Unfortunately, file B has multiple entries with the same ID and different names. Sometimes these names are exact matches of the ID in file A, but other times they are close or reworded. For now, I am finding those direct 1:1 matches, and removing the non-matches with the same ID.

Question: Is there a cleaner way to remove known non-matches inside of groups which contain 1:1 matches, while not removing non-matches in other groups?

Example data:

example <- tribble(
  ~id, ~expect, ~possible, ~status,
  1, "box of forks", "spoon drawer", "review",
  1, "box of forks", "box of forks", "match",
  1, "box of forks", "cheese knife", "review",
  2, "dish washer", "dish washing machine", "review",
  2, "dish washer", "oven", "review",
  2, "dish washer", "microwave", "review",
)

In this example, row 2 is a 1:1 match, so it was given the status "match". I, definitely a human, know that row 4 is also a match, because I can read and understand the data set. Since I expect there will be cases like this, I can't simply do filter(status == "match"). However, before I get into reading the file, I want to remove the rows that I now know are going to be junk. As such, I have figured out that I can do:

example %>% 
  group_by(id) %>% 
  mutate(matches = case_when(
    status == "review" ~ 0,
    status == "match" ~ 1,
  ),
  total = sum(matches)
  ) %>% 
  filter(
    !(matches == 0 & total > 0)
  )

Which gives the correct/expected result:

     id expect       possible             status matches total
  <dbl> <chr>        <chr>                <chr>    <dbl> <dbl>
1     1 box of forks box of forks         match        1     1
2     2 dish washer  dish washing machine review       0     0
3     2 dish washer  oven                 review       0     0
4     2 dish washer  microwave            review       0     0

While this works, it looks clunky and makes me sad. Is there a cleaner way of removing entries that are still "review" and exist inside of groups which contain at least one row with "match"?


Solution

  • Maybe just use all.

    example %>%
      filter(all(status=="review") | status=="match", .by=id)
    
    # A tibble: 4 × 4
         id expect       possible             status
      <dbl> <chr>        <chr>                <chr> 
    1     1 box of forks box of forks         match 
    2     2 dish washer  dish washing machine review
    3     2 dish washer  oven                 review
    4     2 dish washer  microwave            review