I understand that my question title is quite wordy and may be difficult to understand so I'll try my best to explain it here.
Here is some example data:
transactions <- tibble(id = seq(1:7),
day = paste(rep("day", each = 7), seq(1:7), sep = ""),
sent_to = c(NA, "Garden Cinema", "Pasta House", NA, "Blue Superstore", "Jane", "Joe"),
received_from = c("ATM", NA, NA, "Sarah", NA, NA, NA),
reference = c("add_cash", "cinema_tickets", "meal", "gift", "shopping", "reimbursed", "reimbursed"),
decrease = c(NA, 10.8, 12.5, NA, 15.25, NA, NA),
increase = c(50, NA, NA, 30, NA, 5.40, 7.25),
reimbursed_id = c(NA, "R", "R", NA, NA, 2, 3))
# # A tibble: 7 × 7
# id day sent_to received_from reference decrease increase reimbursed_id
# <int> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
# 1 1 day1 NA ATM add_cash NA 50 NA
# 2 2 day2 Garden Cinema NA cinema_tickets 10.8 NA R
# 3 3 day3 Pasta House NA meal 12.5 NA R
# 4 4 day4 NA Sarah gift NA 30 NA
# 5 5 day5 Blue Superstore NA shopping 15.2 NA NA
# 6 6 day6 Jane NA reimbursed NA 5.4 2
# 7 7 day7 Joe NA reimbursed NA 7.25 3
Explanation for reimbursed_id
column:
R
indicates the value in the decrease column is not representative of the user's actual spending because it includes the amount paid on someone's behalf2
(or any number) represents the id
for which the user was reimbursed (returned the borrowed amount)Desired outcome:
I would like to add an "actual_decrease" column to this dataset which essentially looks through the reimbursed_id
column for numerical values, collects the reimbursed amount in the increase
column and subtracts it from the values in the decrease
for the respective id
's.
I'm not sure how to better explain this, so here is what the desired result looks like:
# # A tibble: 7 × 9
# id day sent_to received_from reference decrease increase reimbursed_id actual_decrease
# <int> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
# 1 1 day1 NA ATM add_cash NA 50 NA NA
# 2 2 day2 Garden Cinema NA cinema_tickets 10.8 NA R 5.4
# 3 3 day3 Pasta House NA meal 12.5 NA R 5.25
# 4 4 day4 NA Sarah gift NA 30 NA NA
# 5 5 day5 Blue Superstore NA shopping 15.2 NA NA 15.2
# 6 6 day6 Jane NA reimbursed NA 5.4 2 NA
# 7 7 day7 Joe NA reimbursed NA 7.25 3 NA
I can't think of any functions that could help solve this besides a really complicated ifelse()
statement, and maybe a for loop (but I'd prefer to avoid loops).
Note that it is also preferred that your answer doesn't involve the row numbers, instead the id
column, because for my actual dataset these values do not correspond.
Any input is appreciated :)
One option to approach this issue would be via a join. Basically I first create a helper df by filtering for the "reimbursed" transactions which could then be joined to the original df to finally compute the actual_decrease
:
library(dplyr, warn = FALSE)
library(tidyr)
transactions_help <- transactions |>
filter(reference == "reimbursed") |>
select(reimbursed_decrease = increase, reimbursed_id) |>
mutate(reimbursed_id = as.numeric(reimbursed_id))
transactions <- transactions |>
left_join(transactions_help, by = c("id" = "reimbursed_id")) |>
replace_na(list(reimbursed_decrease = 0)) |>
mutate(actual_decrease = decrease - reimbursed_decrease)
transactions
#> # A tibble: 7 × 10
#> id day sent_to recei…¹ refer…² decre…³ incre…⁴ reimb…⁵ reimb…⁶ actua…⁷
#> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 day1 <NA> ATM add_ca… NA 50 <NA> 0 NA
#> 2 2 day2 Garden Ci… <NA> cinema… 10.8 NA R 5.4 5.4
#> 3 3 day3 Pasta Hou… <NA> meal 12.5 NA R 7.25 5.25
#> 4 4 day4 <NA> Sarah gift NA 30 <NA> 0 NA
#> 5 5 day5 Blue Supe… <NA> shoppi… 15.2 NA <NA> 0 15.2
#> 6 6 day6 Jane <NA> reimbu… NA 5.4 2 0 NA
#> 7 7 day7 Joe <NA> reimbu… NA 7.25 3 0 NA
#> # … with abbreviated variable names ¹received_from, ²reference, ³decrease,
#> # ⁴increase, ⁵reimbursed_id, ⁶reimbursed_decrease, ⁷actual_decrease