Search code examples
rdataframedplyrdistinct

Saving only the last duplicate by group in R


I have an example dataset:

example <- data.frame(
  date = c("6/1/22", "6/2/22", "6/3/22",
           "6/1/22", "6/2/22", "6/2/22", "6/3/22",
           "6/2/22", "6/2/22", "6/2/22", "6/3/22", "6/4/22"),
  sub = c(1101, 1101, 1101,
          1102, 1102, 1102, 1102,
          1103, 1103, 1103, 1103, 1103),
  text = c("a", "b", "c",
           "d","e", "f", "g",
           "h", "i", "j", "k", "l"))

There are some subs that have repeated entries for some dates (e.g. 1102, 1103). I want to keep ONLY the bottom most row for each sub if there are duplicated entries in the date column. These are the two example outputs I want from this dataframe.

Output 1: A dataframe where there are unique dates for each sub

output1 <- data.frame(
  date = c("6/1/22", "6/2/22", "6/3/22",
           "6/1/22", "6/2/22", "6/3/22",
           "6/2/22", "6/3/22", "6/4/22"),
  sub = c(1101, 1101, 1101,
          1102, 1102, 1102,
          1103, 1103, 1103),
  text = c("a", "b", "c",
           "d","f", "g",
           "j", "k", "l")
)

Output 2: A dataframe with ALL entries of the subs and dates where there are multiple copies.

output2 <- data.frame(
  date = c("6/2/22", "6/2/22",
           "6/2/22", "6/2/22", "6/2/22"),
  sub = c(1102, 1102,
          1103, 1103, 1103),
  text = c("e", "f",
           "h", "i", "j")
)

I have seen solutions for this using distinct(), but that usually only keeps the first row with the duplicated value. I would like the latest value (e.g. the bottom most row of the duplicated entry). Does anyone know how to do this? Thank you so much!


Solution

  • Task 1

    With dplyr, you can use slice_tail() by sub and date:

    example %>%
      slice_tail(by = c(sub, date))
    
    #     date  sub text
    # 1 6/1/22 1101    a
    # 2 6/2/22 1101    b
    # 3 6/3/22 1101    c
    # 4 6/1/22 1102    d
    # 5 6/2/22 1102    f
    # 6 6/3/22 1102    g
    # 7 6/2/22 1103    j
    # 8 6/3/22 1103    k
    # 9 6/4/22 1103    l
    

    Task 2

    Use filter(n() > 1) by sub and date:

    example %>%
      filter(n() > 1, .by = c(sub, date))
    
    #     date  sub text
    # 1 6/2/22 1102    e
    # 2 6/2/22 1102    f
    # 3 6/2/22 1103    h
    # 4 6/2/22 1103    i
    # 5 6/2/22 1103    j