Search code examples
rtidyr

Keep 1 row of a group if any rows of the group meet a condition


As part of a larger data-set, I have data along the lines of the following:

data <- structure(
  list(
    ID = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4), 
    Event = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0), 
    Date = c("12-24-2019", "12-25-2019",  "12-26-2019", "12-27-2019", "12-28-2019", "12-29-2019", "12-30-2019",  "12-31-2019", "12-01-2019", "12-02-2019", "12-03-2019", "12-04-2019"
 )
 ), class = "data.frame", row.names = c(NA, -12L))

Resulting in the following:

data-table

Essentially, I am wanting to end up with one row per ID, but which row is kept depends on whether the event occurred for an ID or not.

  • If the event occurs for an ID, I want to keep the date that the event occurred and get rid of the other rows for that ID

  • If the event does not occur at any point for an ID I want to keep the latest date and get rid of the other rows for that ID

Desired output:

data table output

I'm quite new to R and am not sure how to start with this - my googling hasn't helped!

Many thanks for any help


Solution

  • Here is adplyr solution using summarise:

    library(dplyr)
    
    
    data %>%
      group_by(ID) %>%
      summarise(
        Date = c(Date[Event == 1], last(Date))[1],
        Event = max(Event),
      )
    
    # A tibble: 4 × 3
         ID Date       Event
      <dbl> <chr>      <dbl>
    1     1 12-24-2019     1
    2     2 12-28-2019     0
    3     3 12-31-2019     0
    4     4 12-02-2019     1
    

    This works for the provided data and for all datasets, where there is at most one event per ID. Otherwise the code would have to be slightly modified.

    Edit

    If it is possible to have more Events per ID then we could use tail to get the Date of the last Event.

    data %>%
      group_by(ID) %>%
      summarise(
        Date = c(tail(Date[Event == 1], 1), last(Date))[1],
        Event = max(Event),
      )
    

    The idea is very simple: inside summarise I create a c(tail(Date[Event == 1], 1), last(Date))

    • The first element tail(Date[Event == 1], 1) is equal to the last entry (tail(..., 1)) of Date for which Event == 1. If no such entry exists this expression evaluates to a vector of length zero, making the overall vector length 1.

    • The second element is simply the last Date.

    Finally, I use [1] at the end to extract the first element of the vector described above.