Search code examples
rdataframedplyrsubset

Remove only offsetting values


I have data with possible duplicates; however, due to the way it is organized and the size of the data it is hard to tell quickly if it is a duplicate record or if it has been reversed. I'm trying to find a way to delete any record that has been reversed and keep all others, including possible duplicates. The relevant data resembles as follows:

 Date       ID         Amount
1/1/2023   1234        $100.00
1/1/2023   1234        $100.00
1/1/2023   1234        -$100.00
1/3/2023   2345        $150.00
1/5/2023   3456        $200.00
1/6/2023   4444        $300.00
1/6/2023   4444        $300.00
1/6/2023   4444        $300.00
1/6/2023   4444        -$300.00

In this scenario I would keep one record for ID 1234, ID 2345, ID 3456, and two records of ID 4444 (including one duplicate). I'm hitting my head against the wall. I've tried various iterations of the code below but the lag/lead option doesn't seem to work for me.

library(dplyr)

df%>%
    group_by(Date, ID) %>%
    mutate(diff = VALUE + lag(VALUE),
           diff2 = VALUE + lead(VALUE)) %>%
    mutate_at(vars(diff:diff2), funs(coalesce(., 1))) %>%
    filter((diff !=0 & diff2 !=0))  %>%
    select(-diff, -diff2)

Solution

  • Assuming that a reversal must be consecutive (within a particular group, defined by Date and ID), I think this works:

    df %>%
      mutate(amt = sub("^-", "", Amount)) %>%
      group_by(Date, ID, amt) %>%
      mutate(
        reversed = grepl("^-", Amount) & !grepl("^-", lag(Amount)), 
        reversed = reversed | lead(reversed, default = FALSE)
      ) %>%
      filter(!reversed) %>%
      ungroup()
    # # A tibble: 5 × 5TRUE
    #   Date        ID Amount  amt     reversedTRUE
    #   <chr>    <int> <chr>   <chr>   <lgl>   TRUE
    # 1 1/1/2023  1234 $100.00 $100.00 FALSE   TRUE
    # 2 1/3/2023  2345 $150.00 $150.00 FALSE   TRUE
    # 3 1/5/2023  3456 $200.00 $200.00 FALSE   TRUE
    # 4 1/6/2023  4444 $300.00 $300.00 FALSE   TRUE
    # 5 1/6/2023  4444 $300.00 $300.00 FALSE   TRUE