Search code examples
rduplicatesdatedata-cleaning

R: Flagging Sample from Same Specimen w/ Different DOB


I have a dataset that has duplicate samples that have a different date of birth. This obviously should not be the case, so I am trying to come up with a way to flag/mark those particular samples. In the end, the only samples that would have 1's next to them would be duplicated samples that have a different DOB, all duplicates that have same DOB and unique samples would have 0's. Here is a simplified version of the data.

test.df<-data.frame(specimen=c("A","A","B","C","B","D","C","D","E"), 
                    DOB=c(as.Date('2000-05-10'),as.Date('2002-04-13'),as.Date('2001-05-12'),as.Date('2003-06-01'),as.Date('2003-04-21'),as.Date('2000-10-20'),as.Date('2003-06-01'),as.Date('2000-10-20'),as.Date('2001-11-23')))
    specimen    DOB
1    A        2000-05-10
2    A        2002-04-13
3    B        2001-05-12
4    C        2003-06-01 
5    B        2003-04-21 
6    D        2000-10-20
7    C        2003-06-01
8    D        2000-10-20
9    E        2001-11-23

And would like something like this as end result.

 specimen        DOB       diff.dob
1    A        2000-05-10      1
2    A        2002-04-13      1
3    B        2001-05-12      1
4    C        2003-06-01      0
5    B        2003-04-21      1
6    D        2000-10-20      0
7    C        2003-06-01      0
8    D        2000-10-20      0
9    E        2001-11-23      0

Identifying duplicates is obviously the easy part, I am just having trouble adding and extra column of 1's and 0's for if the actual duplicates have a different DOB. Any help would greatly appreciated. Thank you.


Solution

  • You can try ave

    test.df$diff.dob <-  with(test.df, ave(as.numeric(DOB), specimen,
                                  FUN=function(x) length(unique(x))!=1))
    

    Or using dplyr

    library(dplyr)
    test.df %>%
              group_by(specimen) %>%
               mutate(diff.dob=(n_distinct(DOB)!=1)+0)
    #    specimen        DOB diff.dob
    #1        A 2000-05-10        1
    #2        A 2002-04-13        1
    #3        B 2001-05-12        1
    #4        C 2003-06-01        0
    #5        B 2003-04-21        1
    #6        D 2000-10-20        0
    #7        C 2003-06-01        0
    #8        D 2000-10-20        0
    #9        E 2001-11-23        0
    

    Or using data.table

    library(data.table)
      setDT(test.df)[,diff.dob:= (!anyDuplicated(DOB) & .N>1)+0 , specimen][]
    

    Or another possible option with base R

    indx1 <- !with(test.df, duplicated(DOB)|duplicated(DOB, fromLast=TRUE))
    tbl <- table(test.df$specimen)!=1
    (test.df$specimen %in% names(tbl)[tbl] & indx1)+0
    #[1] 1 1 1 0 1 0 0 0 0