Search code examples
runiquecase-when

how to use case_when over unique value first to create a new variable


I have a data frame (x). As you could see, I have different values (bp: yes/no and N) for the same ID, but sometimes, I have a unique ID with always bp ==no.

x
      ID        bp    N
      8012199   no    75
      8012199   yes   2
      8012211   no    118
      8012211   yes   2
      8012321   no    37
      8012341   no    146
      8012342   no    84
  

I would like to create a new variable (frequency: low and high)

x
          ID        bp    N    frequency
          8012199   no    75   
          8012199   yes   2    low
          8012211   no    118  
          8012211   yes   4    high    
          8012321   no    37   low
          8012341   no    146  low 
          8012342   no    84   low

based on different conditions:

  1. if bp == yes | n < 2 :low
  2. if bp == yes | n > 2 :high
  3. if ID is unique (always = no): low

What I've tried so far is

r <-  x %>% 
      mutate(bp= case_when(
      duplicated(ID)==FALSE  ~ "low",
      bp %in% "yes" & n <=2 ~ "low",
      bp %in% "yes" & n > 2 ~ "high" ))

duplicated(x$ID)
FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE 

It works effectively for the "yes" condition, but not for the duplication. Any advice would be fantastic!


Solution

  • Your two examples are slightly different - N in row 4 has changed.

    I've used the second example data:

    x <- tibble::tribble(
            ~ID,        ~bp,    ~N,
          "8012199",   "no",    75,
          "8012199",   "yes",   2,
          "8012211",   "no",    118,
          "8012211",   "yes",   4,
          "8012321",   "no",    37,
          "8012341",   "no",    146,
          "8012342",   "no",    84
    )
    

    Here's a solution using the tidyverse:

    library(tidyverse)
    x %>% 
      group_by(ID) %>% 
      mutate(
        id_count = n(),
        frequency = case_when(
        bp == "yes" & N <=2 ~ "low",
        bp == "yes" & N > 2 ~ "high",
        bp == "no" & id_count == 1 ~ "low",
        TRUE ~ "")
      ) %>% 
      ungroup() %>% 
      select(-id_count)
    

    Basically, the id_count variable calculates the number of rows for each ID number. This is then used to generate the low frequency results.

    Also, notice that I use N <= 2 to represent "low" frequency. I've based this on the example you've included.