Search code examples
rdplyrvariable-assignmentsapplymutate

Assign value of the first row above that meets the condition


Using dplyr, how do I get for each act_id with status “Sent” the creation_date from the previous act_id whose status is “Automatic” or “Received”?

In each group by reference, each status "Sent" must find the creation date of the first "Received" or “Automatic” detected in the same column (status).

Data & Example

data = data.frame(

  creation_date = as.POSIXct(c("2023-05-01", "2023-05-02", "2023-05-03", "2023-05-04", "2023-05-05")),

  end_date = as.POSIXct(c("2023-05-06", "2023-05-07", "2023-05-08", "2023-05-09", "2023-05-10")),

  status = c("Automatic", "Received", "Sent", "Sent", "Received"),

  reference = c("A", "A", "A", "B", "B"),

  act_id = c(1, 2, 3, 4, 5)

)
 

I have tried by creating a new column ("position") which gives me the row_number where the status is "Automatic" and "Received" and then checking the first row that meets the condition to get its creation_date:

data %>%
  mutate(posicion = ifelse((status == "Automatic" | status == "Received"), row_number(), NA_integer_)) %>%

  group_by(reference) %>%

  mutate(creation_date_2 = sapply(seq_along(status),

                               function(i){

                                 if(status[i] == "Sent")  return(as.POSIXct(creation_date[max(which(position < i))]))}                           

))

But I am not sure this is the most correct/efficient way and where I am making mistake.

I have tried as well with:

data %>%
 group_by(reference) %>%
  mutate(creation_date_2 = if_else(
               status == "Sent",
               creation_date[max(position[status %in% c("Automatic", "Received")] > row_number(), na.rm = TRUE)],
as.POSIXct(NA)))

In this case it works, but it gives me the same creation_date for all the act_ids within the same reference.

Thank you in advance!


Solution

  • Okay so I was able to solve this with a function that uses a for loop, which isn't my favorite approach in R, but you might be able to come up with an apply-style alternative.

    look_back <- function(df, id) {
      if(id<2) return(as.Date(NA))
      for (i in (id-1):1) {
        if (df$status[i] %in% c('Automatic', 'Received')) return(df$creation_date[i])
      }
      # if the loop finishes without finding any previous Automatic/Received entries, return NA
      return(as.Date(NA))
    }
    
    group_by(data, act_id) %>% 
      mutate(sent_date=case_when(
        status=='Sent' ~ look_back(data, act_id),
        TRUE ~ as.Date(NA)))
    

    This gives a nice clean output:

    # A tibble: 5 x 6
    # Groups:   act_id [5]
      creation_date       end_date            status    reference act_id sent_date          
      <dttm>              <dttm>              <chr>     <chr>      <dbl> <dttm>             
    1 2023-05-01 00:00:00 2023-05-06 00:00:00 Automatic A              1 NA                 
    2 2023-05-02 00:00:00 2023-05-07 00:00:00 Received  A              2 NA                 
    3 2023-05-03 00:00:00 2023-05-08 00:00:00 Sent      A              3 2023-05-02 00:00:00
    4 2023-05-04 00:00:00 2023-05-09 00:00:00 Sent      B              4 2023-05-02 00:00:00
    5 2023-05-05 00:00:00 2023-05-10 00:00:00 Received  B              5 NA