I have an example dataset:
example <- data.frame(
date = c("6/1/22", "6/2/22", "6/3/22",
"6/1/22", "6/2/22", "6/2/22", "6/3/22",
"6/2/22", "6/2/22", "6/2/22", "6/3/22", "6/4/22"),
sub = c(1101, 1101, 1101,
1102, 1102, 1102, 1102,
1103, 1103, 1103, 1103, 1103),
text = c("a", "b", "c",
"d","e", "f", "g",
"h", "i", "j", "k", "l"))
There are some sub
s that have repeated entries for some dates (e.g. 1102, 1103). I want to keep ONLY the bottom most row for each sub if there are duplicated entries in the date
column. These are the two example outputs I want from this dataframe.
Output 1: A dataframe where there are unique dates for each sub
output1 <- data.frame(
date = c("6/1/22", "6/2/22", "6/3/22",
"6/1/22", "6/2/22", "6/3/22",
"6/2/22", "6/3/22", "6/4/22"),
sub = c(1101, 1101, 1101,
1102, 1102, 1102,
1103, 1103, 1103),
text = c("a", "b", "c",
"d","f", "g",
"j", "k", "l")
)
Output 2: A dataframe with ALL entries of the sub
s and date
s where there are multiple copies.
output2 <- data.frame(
date = c("6/2/22", "6/2/22",
"6/2/22", "6/2/22", "6/2/22"),
sub = c(1102, 1102,
1103, 1103, 1103),
text = c("e", "f",
"h", "i", "j")
)
I have seen solutions for this using distinct()
, but that usually only keeps the first row with the duplicated value. I would like the latest value (e.g. the bottom most row of the duplicated entry). Does anyone know how to do this? Thank you so much!
With dplyr
, you can use slice_tail()
by sub
and date
:
example %>%
slice_tail(by = c(sub, date))
# date sub text
# 1 6/1/22 1101 a
# 2 6/2/22 1101 b
# 3 6/3/22 1101 c
# 4 6/1/22 1102 d
# 5 6/2/22 1102 f
# 6 6/3/22 1102 g
# 7 6/2/22 1103 j
# 8 6/3/22 1103 k
# 9 6/4/22 1103 l
Use filter(n() > 1)
by sub
and date
:
example %>%
filter(n() > 1, .by = c(sub, date))
# date sub text
# 1 6/2/22 1102 e
# 2 6/2/22 1102 f
# 3 6/2/22 1103 h
# 4 6/2/22 1103 i
# 5 6/2/22 1103 j