I am trying to remove duplicated rows in my data frame, but either distinct(d)
or filter(duplicated(d))
does not remove the duplicated rows (where d
is the data frame name with duplicated rows) -- the functions do not recognize the duplicated rows. Is there any common reason why this happens?
Below is the example dataset using dput
.
structure(list(id.case = c("114746", "114746", "114746", "114746",
"114746", "114746", "114746", "114746", "114746", "114746", "114746",
"114746", "114746", "114746", "114746", "114746", "114746", "114746",
"114746", "114746"), id.pair = c("78272-10794", "9330-10794",
"9330-10794", "80739-42071", "80739-42071", "42114-10794", "42114-10794",
"84701-42114", "84701-42114", "5533-42071", "5533-42071", "8876-5533",
"8876-5533", "5652-42114", "5652-42114", "8920-5652", "8920-5652",
"78272-5533", "78272-5533", "9114-78272"), e1.conditional.dyad = c(1.07224025692901,
0.568380969299369, 0.568380969302098, 0.252545406662165, 0.252545406663273,
-1.21808723071715, -1.21808723071797, -4.1477891182987, -4.14778911829956,
-1.48315629665277, -1.48315629665359, -1.3047217588809, -1.30472175888309,
-1.63547814316539, -1.63547814316453, -0.671008645771849, -0.671008645772957,
-0.0801843233972761, -0.0801843233964519, 2.30874742062369)), row.names = c(NA,
20L), class = "data.frame")
I am trying to use the below code.
d %>% distinct
Up front: your numbers are not exactly the same, see
d[2:3,]
# id.case id.pair e1.conditional.dyad
# 2 114746 9330-10794 0.568381
# 3 114746 9330-10794 0.568381
diff(d[2:3,3])
# [1] 2.729039e-12
Computers have limitations when it comes to floating-point numbers (aka double
, numeric
, float
). This is a fundamental limitation of computers in general, in how they deal with non-integer numbers. This is not specific to any one programming language. There are some add-on libraries or packages that are much better at arbitrary-precision math, but I believe most main-stream languages (this is relative/subjective, I admit) do not use these by default. Refs: Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754
To continue using distinct
without losing the actual precision of your values, try
d %>%
distinct(id.case, id.pair, ign = round(e1.conditional.dyad, 8), .keep_all = TRUE) %>%
select(-ign)
# id.case id.pair e1.conditional.dyad
# 1 114746 78272-10794 1.07224026
# 2 114746 9330-10794 0.56838097
# 3 114746 80739-42071 0.25254541
# 4 114746 42114-10794 -1.21808723
# 5 114746 84701-42114 -4.14778912
# 6 114746 5533-42071 -1.48315630
# 7 114746 8876-5533 -1.30472176
# 8 114746 5652-42114 -1.63547814
# 9 114746 8920-5652 -0.67100865
# 10 114746 78272-5533 -0.08018432
# 11 114746 9114-78272 2.30874742
where the decision to use 8 digits is arbitrary (here) and sensitive to your knowledge of the data.