Search code examples
rdata.table

Duplicated not returning expected result


Let's say the below df

df <- data.table(id = c(1, 2, 2, 3)
                , datee = as.Date(c('2022-01-01', '2022-01-02', '2022-01-02', '2022-01-03'))
                ); df

   id      datee
1:  1 2022-01-01
2:  2 2022-01-02
3:  2 2022-01-02
4:  3 2022-01-03

and I wanted to keep only the non-duplicated rows

df[!duplicated(id, datee)]

    id      datee
1:  1 2022-01-01
2:  2 2022-01-02
3:  3 2022-01-03

which worked. However, with the below df_1

df_1 <- data.table(a = c(1,1,2)
                 , b = c(1,1,3)
                 ); df_1
   a b
1: 1 1
2: 1 1
3: 2 3

using the same method does not rid the duplicated rows

df_1[!duplicated(a, b)]

   a b
1: 1 1
2: 1 1
3: 2 3

What am I doing wrong?


Solution

  • Let's dive in to why your df_1[!duplicated(a, b)] doesn't work.

    duplicated uses S3 method dispatch.

    library(data.table)
    
    .S3methods("duplicated")
    # [1] duplicated.array           duplicated.data.frame     
    # [3] duplicated.data.table*     duplicated.default        
    # [5] duplicated.matrix          duplicated.numeric_version
    # [7] duplicated.POSIXlt         duplicated.warnings       
    # see '?methods' for accessing help and source code
    

    Looking at those, we aren't using duplicated.data.table since we're calling it with individual vectors (it has no idea it is being called from within a data.table context), so it makes sense to look into duplicated.default.

    > debugonce(duplicated.default)
    > df_1[!duplicated(a, b)]
    debugging in: duplicated.default(a, b)
    debug: .Internal(duplicated(x, incomparables, fromLast, if (is.factor(x)) min(length(x), 
        nlevels(x) + 1L) else nmax))
    Browse[2]> match.call()                           # ~ "how this function was called"
    duplicated.default(x = a, incomparables = b)
    

    Confirming with ?duplicated:

           x: a vector or a data frame or an array or 'NULL'.
    
    incomparables: a vector of values that cannot be compared.  'FALSE' is
              a special value, meaning that all values can be compared, and
              may be the only value accepted for methods other than the
              default.  It will be coerced internally to the same type as
              'x'.
    

    From this we can see that a is being used for deduplication, and b is used as "incomparable". Because b contains the value 1 that is in a and duplicated, then rows where a==1 are not tested for duplication.

    To confirm, if we change b such that it does not share (duplicated) values with a, we see that the deduplication of a works as intended (though it is silently ignoring b's dupes due to the argument problem):

    df_1 <- data.table(a = c(1,1,2) , b = c(2,2,4))
    df_1[!duplicated(a, b)]                  # accidentally correct, `b` is not used
    #        a     b
    #    <num> <num>
    # 1:     1     2
    # 2:     2     4
    unique(df_1, by = c("a", "b"))
    #        a     b
    #    <num> <num>
    # 1:     1     2
    # 2:     2     4
    
    
    df_2 <- data.table(a = c(1,1,2) , b = c(2,3,4))
    df_2[!duplicated(a, b)]                  # wrong, `b` is not considered
    #        a     b
    #    <num> <num>
    # 1:     1     2
    # 2:     2     4
    unique(df_2, by = c("a", "b"))
    #        a     b
    #    <num> <num>
    # 1:     1     2
    # 2:     1     3
    # 3:     2     4
    

    (Note that unique above is actually data.table:::unique.data.table, another S3 method dispatch provided by the data.table package.)

    debug and debugonce are your friends :-)