Search code examples
rdataframedplyrtibblemutate

How to conditionally select a column, and subtract values in those rows from rows in another conditionally selected column in R?


I understand that my question title is quite wordy and may be difficult to understand so I'll try my best to explain it here.

Here is some example data:

transactions <- tibble(id = seq(1:7),
                       day = paste(rep("day", each = 7), seq(1:7), sep = ""),
                       sent_to = c(NA, "Garden Cinema", "Pasta House", NA, "Blue Superstore", "Jane", "Joe"),
                       received_from = c("ATM", NA, NA, "Sarah", NA, NA, NA),
                       reference = c("add_cash", "cinema_tickets", "meal", "gift", "shopping", "reimbursed", "reimbursed"),
                       decrease = c(NA, 10.8, 12.5, NA, 15.25, NA, NA),
                       increase = c(50, NA, NA, 30, NA, 5.40, 7.25),
                       reimbursed_id = c(NA, "R", "R", NA, NA, 2, 3))
# # A tibble: 7 × 7
#      id   day   sent_to         received_from reference      decrease  increase reimbursed_id
#    <int>  <chr> <chr>           <chr>         <chr>          <dbl>     <dbl>    <chr>
# 1     1   day1  NA              ATM           add_cash       NA        50       NA
# 2     2   day2  Garden Cinema   NA            cinema_tickets 10.8      NA       R
# 3     3   day3  Pasta House     NA            meal           12.5      NA       R
# 4     4   day4  NA              Sarah         gift           NA        30       NA
# 5     5   day5  Blue Superstore NA            shopping       15.2      NA       NA
# 6     6   day6  Jane            NA            reimbursed     NA        5.4      2
# 7     7   day7  Joe             NA            reimbursed     NA        7.25     3

Explanation for reimbursed_id column:

  • R indicates the value in the decrease column is not representative of the user's actual spending because it includes the amount paid on someone's behalf
  • 2 (or any number) represents the id for which the user was reimbursed (returned the borrowed amount)

Desired outcome:

I would like to add an "actual_decrease" column to this dataset which essentially looks through the reimbursed_id column for numerical values, collects the reimbursed amount in the increase column and subtracts it from the values in the decrease for the respective id's.

I'm not sure how to better explain this, so here is what the desired result looks like:

# # A tibble: 7 × 9
#      id   day   sent_to         received_from reference      decrease  increase reimbursed_id  actual_decrease
#    <int>  <chr> <chr>           <chr>         <chr>          <dbl>     <dbl>    <chr>          <dbl>
# 1     1   day1  NA              ATM           add_cash       NA        50       NA             NA
# 2     2   day2  Garden Cinema   NA            cinema_tickets 10.8      NA       R              5.4
# 3     3   day3  Pasta House     NA            meal           12.5      NA       R              5.25
# 4     4   day4  NA              Sarah         gift           NA        30       NA             NA
# 5     5   day5  Blue Superstore NA            shopping       15.2      NA       NA             15.2
# 6     6   day6  Jane            NA            reimbursed     NA        5.4      2              NA
# 7     7   day7  Joe             NA            reimbursed     NA        7.25     3              NA

I can't think of any functions that could help solve this besides a really complicated ifelse() statement, and maybe a for loop (but I'd prefer to avoid loops).

Note that it is also preferred that your answer doesn't involve the row numbers, instead the id column, because for my actual dataset these values do not correspond.

Any input is appreciated :)


Solution

  • One option to approach this issue would be via a join. Basically I first create a helper df by filtering for the "reimbursed" transactions which could then be joined to the original df to finally compute the actual_decrease:

    library(dplyr, warn = FALSE)
    library(tidyr)
    
    transactions_help <- transactions |> 
      filter(reference == "reimbursed") |> 
      select(reimbursed_decrease = increase, reimbursed_id) |> 
      mutate(reimbursed_id = as.numeric(reimbursed_id))
    
    transactions <- transactions |> 
      left_join(transactions_help, by = c("id" = "reimbursed_id")) |> 
      replace_na(list(reimbursed_decrease = 0)) |> 
      mutate(actual_decrease = decrease - reimbursed_decrease)
    
    transactions
    #> # A tibble: 7 × 10
    #>      id day   sent_to    recei…¹ refer…² decre…³ incre…⁴ reimb…⁵ reimb…⁶ actua…⁷
    #>   <dbl> <chr> <chr>      <chr>   <chr>     <dbl>   <dbl> <chr>     <dbl>   <dbl>
    #> 1     1 day1  <NA>       ATM     add_ca…    NA     50    <NA>       0      NA   
    #> 2     2 day2  Garden Ci… <NA>    cinema…    10.8   NA    R          5.4     5.4 
    #> 3     3 day3  Pasta Hou… <NA>    meal       12.5   NA    R          7.25    5.25
    #> 4     4 day4  <NA>       Sarah   gift       NA     30    <NA>       0      NA   
    #> 5     5 day5  Blue Supe… <NA>    shoppi…    15.2   NA    <NA>       0      15.2 
    #> 6     6 day6  Jane       <NA>    reimbu…    NA      5.4  2          0      NA   
    #> 7     7 day7  Joe        <NA>    reimbu…    NA      7.25 3          0      NA   
    #> # … with abbreviated variable names ¹​received_from, ²​reference, ³​decrease,
    #> #   ⁴​increase, ⁵​reimbursed_id, ⁶​reimbursed_decrease, ⁷​actual_decrease