Search code examples
rdataframedplyrtidyverse

group_by and filter removes too many rows


Below is my example, let me explain what I am trying to do, although it is not working as I want it to.

I need to find all instances where there are 2+ unique values in column z on the same date for the same person. However, I need to find where a specific list of values in column z are present.

library(tidyverse)

x <- c("Person A","Person A","Person A","Person A","Person A","Person A")
y <- c("2022-01-01","2022-01-01","2022-01-20","2022-02-01","2022-02-01","2022-02-01")
z <- c("A","D","A","A","C","B")

df <- data.frame(x,y,z)
df

df %>% 
  group_by(x,y) %>% 
  mutate(unique_z = n_distinct(z)) %>% 
  # ungroup() %>% 
  filter(unique_z > 1,
         z %in% c("C","B"))

Below is an image of what I want the output to be, but I cannot figure it out.

Row 1 and 2 should be removed because even though it is 2 unique values of z on the same date for the same person, it does not include "C" or "B".

Row 3 is removed because it is only one unique value for that person and date.

Rows 4, 5, and 6 all should stay because that person, date combination has three unique values of z. Also, "C" and/or "B" occur in these rows. For some reason, row 4 is being removed every time. I to see the other values of z on that person, date combination. I thought grouping and filtering would do this, but it does not seem to the way I am doing it.


Solution

  • You need to use any to check for the presence of c("B", "C") within each group and not at each row; see below:

    library(dplyr)
    
    df %>% 
      group_by(x,y) %>% 
      mutate(unique_z = n_distinct(z)) %>% 
      filter(unique_z > 1,
             any(z %in% c("B","C")))
    ## any(z %in% c("C")) & any(z %in% c("B"))) 
    ## use this one instead if you want B and C present at the same time ...
    ## ... and two B's or two C's are not desired
    
    # # A tibble: 3 x 4
    # # Groups:   x, y [1]
    #   x        y          z     unique_z
    #   <fct>    <fct>      <fct>    <int>
    # 1 Person A 2022-02-01 A            3
    # 2 Person A 2022-02-01 C            3
    # 3 Person A 2022-02-01 B            3