Search code examples
rdata.tableunique

Unique in data.table dropped some value by mistake


I need to remove duplicates from a big data frame that has 100 million rows. I am testing if data.table can help me on that. However, in the following code, unique() in data.table did not generate the same result as the unique() for data.frame. Is there a possible bug in setkey in data.table?

library(data.table)
tmp <- data.frame(id=c(1000000128152, 1000000228976, 1000000235508, 1000000294933, 1000000311288, 1000000353770, 1000000441585, 1000000466482, 1000000473521, 
                         1000000491353, 1000000497787, 1000000534948, 1000000589071, 1000000622890, 1000000658287, 1000000695865, 1000000731674, 1000000780659, 
                         1000000818218, 1000000834389, 1000000877189, 1000000937770, 1000000937770, 1000000996135, 1000001061831, 1000001062057, 1000001065241, 
                         1000001097542, 1000001122242, 1000001177167, 1000001194078, 1000001216323, 1000001232155, 1000001294998, 1000001361126, 1000001361126, 
                         1000001389830, 1000001411284, 1000001415793, 1000001417557, 1000001485326, 1000001565513, 1000001624601, 1000001650282, 1000001681805, 
                         1000001683548, 1000001683548, 1000001693445, 1000001693455, 1000001693462, 1000001693466, 1000001693490, 1000001693490, 1000001703493, 
                         1000001703511, 1000001703518, 1000001703546, 1000001703554, 1000001703613, 1000001703644))
unique(tmp$id)
DT <- data.table(tmp)
setkey(DT, id)
DTU <- unique(DT)
DTU$id

Results from the unique(tmp$id):
 [1] 1000000128152 1000000228976 1000000235508 1000000294933 1000000311288 1000000353770 1000000441585 1000000466482 1000000473521 1000000491353 1000000497787 1000000534948
[13] 1000000589071 1000000622890 1000000658287 1000000695865 1000000731674 1000000780659 1000000818218 1000000834389 1000000877189 1000000937770 1000000996135 1000001061831
[25] 1000001062057 1000001065241 1000001097542 1000001122242 1000001177167 1000001194078 1000001216323 1000001232155 1000001294998 1000001361126 1000001389830 1000001411284
[37] 1000001415793 1000001417557 1000001485326 1000001565513 1000001624601 1000001650282 1000001681805 1000001683548 1000001693445 1000001693455 1000001693462 1000001693466
[49] 1000001693490 1000001703493 1000001703511 1000001703518 1000001703546 1000001703554 1000001703613 1000001703644

Result from DTU$id:
 [1] 1000000128152 1000000228976 1000000235508 1000000294933 1000000311288 1000000353770 1000000441585 1000000466482 1000000473521 1000000491353 1000000497787 1000000534948
[13] 1000000589071 1000000622890 1000000658287 1000000695865 1000000731674 1000000780659 1000000818218 1000000834389 1000000877189 1000000937770 1000000996135 1000001061831
[25] 1000001062057 1000001065241 1000001097542 1000001122242 1000001177167 1000001194078 1000001216323 1000001232155 1000001294998 1000001361126 1000001389830 1000001411284
[37] 1000001415793 1000001417557 1000001485326 1000001565513 1000001624601 1000001650282 1000001681805 1000001683548 1000001693445 1000001693455 1000001693462 1000001693490
[49] 1000001703493 1000001703511 1000001703518 1000001703546 1000001703554 1000001703613 1000001703644

Comparing the two, we see that 1000001693466 got dropped in DTU by mistake. Any suggestions on why? I suspect it's the setkey because when I subtracted the 1000000000000 from all the numbers, the result is the same.


Solution

  • Edit (from Arun): The default rounding feature has been removed in the current development version of data.table, v1.9.7, and is likely stay that way moving forward. See here for installation instructions.

    This also means that you're fully responsible for understanding limitations in representing floating point numbers and dealing with them :-).


    help(setkey) says (data.table version 1.9.6):

    Note that columns of numeric types (i.e., double) have their last two bytes rounded off while computing order, by default, to avoid any unexpected behaviour due to limitations in representing floating point numbers precisely. Have a look at setNumericRounding to learn more.

    By changing rounding to 1 byte before keying

    DT <- data.table(tmp)
    setNumericRounding(1)   # set rounding
    setkey(DT, id) 
    

    the value no longer will be dropped.

    However, help(setNumericRounding) says

    For large numbers (integers > 2^31), we recommend using bit64::integer64 rather than setting rounding to 0.