I have a dataset that has duplicate samples that have a different date of birth. This obviously should not be the case, so I am trying to come up with a way to flag/mark those particular samples. In the end, the only samples that would have 1's next to them would be duplicated samples that have a different DOB, all duplicates that have same DOB and unique samples would have 0's. Here is a simplified version of the data.
test.df<-data.frame(specimen=c("A","A","B","C","B","D","C","D","E"),
DOB=c(as.Date('2000-05-10'),as.Date('2002-04-13'),as.Date('2001-05-12'),as.Date('2003-06-01'),as.Date('2003-04-21'),as.Date('2000-10-20'),as.Date('2003-06-01'),as.Date('2000-10-20'),as.Date('2001-11-23')))
specimen DOB
1 A 2000-05-10
2 A 2002-04-13
3 B 2001-05-12
4 C 2003-06-01
5 B 2003-04-21
6 D 2000-10-20
7 C 2003-06-01
8 D 2000-10-20
9 E 2001-11-23
And would like something like this as end result.
specimen DOB diff.dob
1 A 2000-05-10 1
2 A 2002-04-13 1
3 B 2001-05-12 1
4 C 2003-06-01 0
5 B 2003-04-21 1
6 D 2000-10-20 0
7 C 2003-06-01 0
8 D 2000-10-20 0
9 E 2001-11-23 0
Identifying duplicates is obviously the easy part, I am just having trouble adding and extra column of 1's and 0's for if the actual duplicates have a different DOB. Any help would greatly appreciated. Thank you.
You can try ave
test.df$diff.dob <- with(test.df, ave(as.numeric(DOB), specimen,
FUN=function(x) length(unique(x))!=1))
Or using dplyr
library(dplyr)
test.df %>%
group_by(specimen) %>%
mutate(diff.dob=(n_distinct(DOB)!=1)+0)
# specimen DOB diff.dob
#1 A 2000-05-10 1
#2 A 2002-04-13 1
#3 B 2001-05-12 1
#4 C 2003-06-01 0
#5 B 2003-04-21 1
#6 D 2000-10-20 0
#7 C 2003-06-01 0
#8 D 2000-10-20 0
#9 E 2001-11-23 0
Or using data.table
library(data.table)
setDT(test.df)[,diff.dob:= (!anyDuplicated(DOB) & .N>1)+0 , specimen][]
Or another possible option with base R
indx1 <- !with(test.df, duplicated(DOB)|duplicated(DOB, fromLast=TRUE))
tbl <- table(test.df$specimen)!=1
(test.df$specimen %in% names(tbl)[tbl] & indx1)+0
#[1] 1 1 1 0 1 0 0 0 0