Search code examples
rdplyruniquedistinct-values

R: How do I choose which row dplyr::distinct() keeps based on a value in another variable?


The real life problem: I have subjects with MRI scan data. Some of which have been scanned multiple times (separate rows). Some of those were scanned under different protocols each time. I want to keep all unique rows by subject ID, and if a subject was scanned under two different protocols, I want it to prefer one over the other.

The toy example:

library(dplyr)  
df <- tibble(
        id = c("A", "A", "B", "C", "C", "D"), 
        protocol = c("X", "Y", "X", "X", "X", "Y"),
        date = c(seq(as.Date("2018-01-01"), as.Date("2018-01-06"), 
                 by="days")),
        var = 1:6)

I want to return a data frame with all unique subjects by id. When it comes to a duplicate value, instead of automatically keeping the first entry, I want it to keep the entry with "Y" as the protocol if it has that choice, but not to get rid of rows with "X" otherwise.

In the example, it would keep rows 2, 3, 4, and 6.

I prefer dplyr, but am open to other suggestions.

Nothing that I've tried even begins to work:

df %>% distinct(id, .keep_all = TRUE) #Nope! 

df %>% distinct(id, protocol == "Y", .keep_all = TRUE) #Nope!  

df$protocol <- factor(df$protocol, levels = c("Y", "X"))
df %>% distinct(id, .keep_all = TRUE) #Nope!  

df %>% group_by(id) %>% filter(protocol == "Y") #Nope!

Two good answers: @RobJensen suggests

df %>% arrange(id, desc(protocol == 'Y')) %>% distinct(id, .keep_all = TRUE)  

If I have multiple protocols and wish to assign an order to which they will be chosen, I can create a new variable where I assign the protocols an integer in order of preference, then use the suggestion from @joran

df %>% group_by(id) %>% arrange(desc(protocol),var) %>% slice(1)  

Thanks!


Solution

  • Arranging alphabetically works in the stated simple case, but if you want you can add a protocol_preference variable to give an ordering of what you'd prefer to be selected if Y isn't available, and to select "Y" even if it doesn't happen to be the last protocol value when sorted alphabetically.

    Building off @davechilders answer and @Nathan Werth 's idea of creating a factor based on an "order of importance" vector

    order_of_importance <- c("Y", "Z", "X")
    
        df2 %>%
          mutate(protocol = factor(protocol, order_of_importance)) %>%
          arrange(id, protocol) %>%
          distinct(id, .keep_all = TRUE)
    

    Or if you just want to select 'Y' and don't have a preference for what's selected if 'Y' isn't avaialable you can do

    df %>% 
        arrange(id, desc(protocol == 'Y')) %>% 
        distinct(id, .keep_all = TRUE)