Search code examples
rduplicatesdata.tablegrouping

Trying to find all duplicates, but by group in R


I am trying to find the duplicates, but based on a grouping. The grouping variable I want to use is called MRN (i.e. BMIdf$MRN). In other words, I want to find the duplicates, but only if it is a duplicate for the specific MRN id. I am not sure how to incorporate that grouping into my syntax. Here is what I have so far.

BMIdf$dupobs<-ifelse(((duplicated(BMIdf$OBSERVATION_DATE))| 
(duplicated(BMIdf$OBSERVATION_DATE,fromLast = TRUE))),TRUE,FALSE)

How can I return TRUE only if it is a duplicate for a given MRN id? Open to non-data.table methods

Here is some sample data:

sample <- data.frame(MRN = c(1, 2, 1, 2, 3, 4, 3),
                     OBSERVATION_DATE = anydate(c("2013-02-19", "2013-02-28", "2013-02-19", "2013-02-28", "2013-02-28", "2013-03-08", "2014-01-06")))

So I want it to recognize the 2nd and 4th dates in the vector as duplicates. But not the 5th. As the 5th has a different MRN id


Solution

  • data.table

    library(data.table)
    as.data.table(sample)[, dupobs := any(duplicated(.SD)), by = MRN][]
    #      MRN OBSERVATION_DATE dupobs
    #    <num>           <Date> <lgcl>
    # 1:     1       2013-02-19   TRUE
    # 2:     2       2013-02-28   TRUE
    # 3:     1       2013-02-19   TRUE
    # 4:     2       2013-02-28   TRUE
    # 5:     3       2013-02-28  FALSE
    # 6:     4       2013-03-08  FALSE
    # 7:     3       2014-01-06  FALSE
    

    dplyr

    library(dplyr)
    sample %>%
      group_by(MRN) %>%
      mutate(dupobs = any(duplicated(OBSERVATION_DATE))) %>%
      ungroup()
    # # A tibble: 7 x 3
    #     MRN OBSERVATION_DATE dupobs
    #   <dbl> <date>           <lgl> 
    # 1     1 2013-02-19       TRUE  
    # 2     2 2013-02-28       TRUE  
    # 3     1 2013-02-19       TRUE  
    # 4     2 2013-02-28       TRUE  
    # 5     3 2013-02-28       FALSE 
    # 6     4 2013-03-08       FALSE 
    # 7     3 2014-01-06       FALSE 
    

    base R

    sample$dupobs <- ave(as.integer(sample$OBSERVATION_DATE), sample$MRN,
                         FUN = function(z) any(duplicated(z))) > 0
    sample
    #   MRN OBSERVATION_DATE dupobs
    # 1   1       2013-02-19   TRUE
    # 2   2       2013-02-28   TRUE
    # 3   1       2013-02-19   TRUE
    # 4   2       2013-02-28   TRUE
    # 5   3       2013-02-28  FALSE
    # 6   4       2013-03-08  FALSE
    # 7   3       2014-01-06  FALSE
    

    With ave, the first argument's class is used for the output, which can be rather inconvenient. For this, I cast to integer (so that the function will work without error); the inner function will initially create a logical, but ave converts it to the integer (of the original vector), which translates false to 0 and true to 1. From there, I compare the output (0s and 1s) against 0 to see if it was true. Minor inconvenience.


    Data

    sample <- structure(list(MRN = c(1, 2, 1, 2, 3, 4, 3), OBSERVATION_DATE = structure(c(15755, 15764, 15755, 15764, 15764, 15772, 16076), class = "Date")), class = "data.frame", row.names = c(NA, -7L))