Search code examples
rdataframedplyrdata-cleaning

Remove reversal transaction


I have transaction level data with some reversal transactions. These transactions are denoted by a negative amount and then a counterpart positive.

trnx_df <- data.frame(Date = c("2018-01-01", "2018-01-01", "2018-01-01", "2018-01-01", "2018-01-03", "2018-01-03", "2018-01-05", "2018-02-01",
                            "2018-02-01", "2018-02-01"),
                   Product = c("A", "A", "A", "A", "B", "B", "B", "A", "A", "A"),
                   Amount = c(-1000, 1000, 1000, 1000, -1000, 1000, 500, -2000, 1000, 2000))

trnx_df

             Date Product Amount
    1  2018-01-01       A  -1000
    2  2018-01-01       A   1000
    3  2018-01-01       A   1000
    4  2018-01-01       A   1000
    5  2018-01-03       B  -1000
    6  2018-01-03       B   1000
    7  2018-01-05       B    500
    8  2018-02-01       A  -2000
    9  2018-02-01       A   1000
    10 2018-02-01       A   2000

I want to arrive at total amount and maximum amount spent by that customer on particular product.

By using dplyr I arrive at:

library(dplyr)

trnx_summary <- trnx_df %>%
group_by(Product) %>%
summarize(Total_amount = sum(Amount),
        Max_amount = max(Amount))

trnx_summary
  Product Total_amount Max_amount
1       A         3000       2000
2       B          500       1000

For total there will be no problem as the negative entry will cancel out positive one but for maximum amount spent I will get wrong output.

The maximum amount for Product A should be 1000 (2000 and -2000 will cancel each other out).

How can I fix this? Also, is there a way to delete these reversal transactions from the dataframe itself?


Solution

  • df %>% #filter the negative transactions, save in dftemp
      filter(Amount < 0) %>% 
      mutate(Amount = abs(Amount)) -> dftemp # in dftemp, negative transactions are positive to ease looking for matches
    
    df %>%  #filter the positive transactions that do no have a negative duplicate
      filter(Amount > 0) %>% 
      anti_join(dftemp) -> dfuniques  
    
    df %>% 
      filter(Amount > 0) %>% #filter positive transactions
      inner_join(dftemp) %>% #merge obs that are both in the original df and in dftemp 
      group_by(Date, Product, Amount) %>%  #group by date, product and amount
      slice(-1) %>% #for each date, product & amount combo, delete 1 row (which is a duplicate of one negative and one positive transaction)
      full_join(dfuniques) %>% # join the unique positive transactions (from here on, you have your desired dataframe with negative and positive transactions that cancelled each other out deleted)
      group_by(Product) %>% 
      summarise(Total_Amount = sum(Amount), Max_Amount = max(Amount))
    
      Product Total_Amount Max_Amount
       <fctr>        <dbl>      <dbl>
    1       A         3000       1000
    2       B          500        500