How to identify and remove duplicates with multiple condition in R

I would like to identify and remove duplicates from my dataset. The issue is that I have two different criteria. First, the main criteria that I am using is the most recent time. However, not all duplicates have dates. On the cases dates are missing, I will have to check one by one to see which row I am going to keep. So, this is what I need:

Suppose I have the following data:

df <- data.frame(id = c("a", "a", "b","b", "c","c"),
                 date = c("2023-05-16", "2023-08-21", "2023-06-05", NA, "2023-05-07", "2023-05-20"))


df$date <- ymd(df$date)

[1] "Date"


  id       date
1  a 2023-05-16
2  a 2023-08-21
3  b 2023-06-05
4  b       <NA>
5  c 2023-05-07
6  c 2023-05-20

First, I need to identify all duplicates that contain NA to compare the rows and see which row I will keep. I need a resulting dataframe similar to this, it contains the duplicate and the unique value:

  id       date
1  b 2023-06-05
2  b       <NA>

Notice that I have both rows so I can compare them. I am keeping the examples simple, but I am not sure on my actual data frame how many duplicates per id I have. Next, I will be left with the remaining dataframe:

  id       date
1  a 2023-05-16
2  a 2023-08-21
3  c 2023-05-07
4  c 2023-05-20

I need to keep only the most recent of the values, and I will end with something like this:

  id       date
1  a 2023-08-21
2  c 2023-05-20

Let me know if something isn't clear and you need more clarification.


  • Part 1: Identify all duplicates that contain NA and keep them in a separate df for review:

    #Ids of NA dates:
    ids_na <- df |> filter(is.na(date)) |> pull(id)
    #Dfs of duplicates with NA:
    is_na <- df |> filter(id %in% ids_na)
    # id       date
    # b 2023-06-05
    # b       <NA>

    Part 2: keep most recent entry of duplicated values

    # Remove dupes with NAs from df
    clean_df <- df |> filter(!(id %in% ids_na))
    # Get the most recent by grouping 
    unique_entries <- clean_df |> group_by(id) |>  summarise(date = max(date))
    # A tibble: 2 × 2
    # id    date      
    # <chr> <date>    
    #   a     2023-08-21
    #   c     2023-05-20