I seem to have difficulty in removing duplicates using either the duplicated
or distinct
functions in dplyr
. I don't know what the problem is but can anyone help? Here is a small part of the data as an example:
df <- data.frame(timestamp = c(1495115680.55608, 1495115680.58941,
1495115680.62274), id = c("2017-05-18-145157833880", "2017-05-18-145157833880",
"2017-05-18-145157833880"), condition = c("childchild", "childchild",
"childchild"))
Both these two functions fail to remove duplicates
df %>%
filter(!duplicated(timestamp))
distinct(df, timestamp, .keep_all = TRUE)
timestamp id condition
1 1495115681 2017-05-18-145157833880 childchild
2 1495115681 2017-05-18-145157833880 childchild
3 1495115681 2017-05-18-145157833880 childchild
The problem is due to floating-point precision. The timestamps are duplicate only to a certain point of decimal places.
One way to solve this is to round and then apply filter()
or distinct()
:
df %>%
mutate(timestamp1 = round(timestamp, 0)) %>%
filter(!duplicated(timestamp1)) %>%
select(-timestamp1)
timestamp id condition
1 1495115681 2017-05-18-145157833880 childchild