I have data with possible duplicates; however, due to the way it is organized and the size of the data it is hard to tell quickly if it is a duplicate record or if it has been reversed. I'm trying to find a way to delete any record that has been reversed and keep all others, including possible duplicates. The relevant data resembles as follows:
Date ID Amount
1/1/2023 1234 $100.00
1/1/2023 1234 $100.00
1/1/2023 1234 -$100.00
1/3/2023 2345 $150.00
1/5/2023 3456 $200.00
1/6/2023 4444 $300.00
1/6/2023 4444 $300.00
1/6/2023 4444 $300.00
1/6/2023 4444 -$300.00
In this scenario I would keep one record for ID 1234, ID 2345, ID 3456, and two records of ID 4444 (including one duplicate). I'm hitting my head against the wall. I've tried various iterations of the code below but the lag/lead option doesn't seem to work for me.
library(dplyr)
df%>%
group_by(Date, ID) %>%
mutate(diff = VALUE + lag(VALUE),
diff2 = VALUE + lead(VALUE)) %>%
mutate_at(vars(diff:diff2), funs(coalesce(., 1))) %>%
filter((diff !=0 & diff2 !=0)) %>%
select(-diff, -diff2)
Assuming that a reversal must be consecutive (within a particular group, defined by Date
and ID
), I think this works:
df %>%
mutate(amt = sub("^-", "", Amount)) %>%
group_by(Date, ID, amt) %>%
mutate(
reversed = grepl("^-", Amount) & !grepl("^-", lag(Amount)),
reversed = reversed | lead(reversed, default = FALSE)
) %>%
filter(!reversed) %>%
ungroup()
# # A tibble: 5 × 5TRUE
# Date ID Amount amt reversedTRUE
# <chr> <int> <chr> <chr> <lgl> TRUE
# 1 1/1/2023 1234 $100.00 $100.00 FALSE TRUE
# 2 1/3/2023 2345 $150.00 $150.00 FALSE TRUE
# 3 1/5/2023 3456 $200.00 $200.00 FALSE TRUE
# 4 1/6/2023 4444 $300.00 $300.00 FALSE TRUE
# 5 1/6/2023 4444 $300.00 $300.00 FALSE TRUE