Search code examples
rduplicatesdistinct

R duplicated rows still remain after distinct


I am trying to remove duplicated rows in my data frame, but either distinct(d) or filter(duplicated(d)) does not remove the duplicated rows (where d is the data frame name with duplicated rows) -- the functions do not recognize the duplicated rows. Is there any common reason why this happens?

Below is the example dataset using dput.

structure(list(id.case = c("114746", "114746", "114746", "114746", 
"114746", "114746", "114746", "114746", "114746", "114746", "114746", 
"114746", "114746", "114746", "114746", "114746", "114746", "114746", 
"114746", "114746"), id.pair = c("78272-10794", "9330-10794", 
"9330-10794", "80739-42071", "80739-42071", "42114-10794", "42114-10794", 
"84701-42114", "84701-42114", "5533-42071", "5533-42071", "8876-5533", 
"8876-5533", "5652-42114", "5652-42114", "8920-5652", "8920-5652", 
"78272-5533", "78272-5533", "9114-78272"), e1.conditional.dyad = c(1.07224025692901, 
0.568380969299369, 0.568380969302098, 0.252545406662165, 0.252545406663273, 
-1.21808723071715, -1.21808723071797, -4.1477891182987, -4.14778911829956, 
-1.48315629665277, -1.48315629665359, -1.3047217588809, -1.30472175888309, 
-1.63547814316539, -1.63547814316453, -0.671008645771849, -0.671008645772957, 
-0.0801843233972761, -0.0801843233964519, 2.30874742062369)), row.names = c(NA, 
20L), class = "data.frame")

I am trying to use the below code.

d %>% distinct

Solution

  • Up front: your numbers are not exactly the same, see

    d[2:3,]
    #   id.case    id.pair e1.conditional.dyad
    # 2  114746 9330-10794            0.568381
    # 3  114746 9330-10794            0.568381
    diff(d[2:3,3])
    # [1] 2.729039e-12
    

    Computers have limitations when it comes to floating-point numbers (aka double, numeric, float). This is a fundamental limitation of computers in general, in how they deal with non-integer numbers. This is not specific to any one programming language. There are some add-on libraries or packages that are much better at arbitrary-precision math, but I believe most main-stream languages (this is relative/subjective, I admit) do not use these by default. Refs: Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754

    To continue using distinct without losing the actual precision of your values, try

    d %>%
      distinct(id.case, id.pair, ign = round(e1.conditional.dyad, 8), .keep_all = TRUE) %>%
      select(-ign)
    #    id.case     id.pair e1.conditional.dyad
    # 1   114746 78272-10794          1.07224026
    # 2   114746  9330-10794          0.56838097
    # 3   114746 80739-42071          0.25254541
    # 4   114746 42114-10794         -1.21808723
    # 5   114746 84701-42114         -4.14778912
    # 6   114746  5533-42071         -1.48315630
    # 7   114746   8876-5533         -1.30472176
    # 8   114746  5652-42114         -1.63547814
    # 9   114746   8920-5652         -0.67100865
    # 10  114746  78272-5533         -0.08018432
    # 11  114746  9114-78272          2.30874742
    

    where the decision to use 8 digits is arbitrary (here) and sensitive to your knowledge of the data.