Search code examples
rdatemultiple-conditions

Select rows based on multiple conditions from two independent database


I have two independent two datasets, one contains event date. Each ID has only one "Eventdate". As follows:

data1 <- data.frame("ID" = c(1,2,3,4,5,6), "Eventdate" = c("2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01", "2019-05-01", "2019-06-01"))
data1
  ID  Eventdate
1  1 2019-01-01
2  2 2019-02-01
3  3 2019-03-01
4  4 2019-04-01
5  5 2019-05-01
6  6 2019-06-01

In another dataset, one ID have multiple event name (Eventcode) and its event date (Eventdate). As follows:

data2 <- data.frame("ID" = c(1,1,2,3,3,3,4,4,7), "Eventcode"=c(201,202,201,204,205,206,209,208,203),"Eventdate" = c("2019-01-01", "2019-01-01", "2019-02-11", "2019-02-15", "2019-03-01", "2019-03-15", "2019-03-10", "2019-03-20", "2019-06-02"))
data2
  ID Eventcode  Eventdate
1  1       201 2019-01-01
2  1       202 2019-01-01
3  2       201 2019-02-11
4  3       204 2019-02-15
5  3       205 2019-03-01
6  3       206 2019-03-15
7  4       209 2019-03-10
8  4       208 2019-03-20
9  7       203 2019-06-02

Two datasets were linked by ID. The ID of two datasets were not all the same. I would like to select cases in data2 with conditions:

  1. Match by ID
  2. Eventdate in data2 >= Eventdate in data1.
  3. If one ID has multiple Eventdates in data2, select the earliest one.
  4. If one ID has multiple Eventcodes at one Eventdate in data2, just randomly select one.

Then merge the selected data2 into data1.

Expected results as follows:

data1
  ID  Eventdate Eventdate.data2 Eventcode
1  1 2019-01-01      2019-01-01       201
2  2 2019-02-01      2019-02-11       201
3  3 2019-03-01      2019-03-01       205
4  4 2019-04-01                          
5  5 2019-05-01                          
6  6 2019-06-01 

or

data1
  ID  Eventdate Eventdate.data2 Eventcode
1  1 2019-01-01      2019-01-01       202
2  2 2019-02-01      2019-02-11       201
3  3 2019-03-01      2019-03-01       205
4  4 2019-04-01                          
5  5 2019-05-01                          
6  6 2019-06-01 

Thank you very very much!


Solution

  • You can try this approach :

    library(dplyr)
    
    left_join(data1, data2, by = 'ID') %>%
      group_by(ID, Eventdate.x) %>%
      summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1], 
                Eventcode = {
                  inds <- Eventdate.y >= Eventdate.x
                  val <- sum(inds, na.rm = TRUE)
                  if(val == 1) Eventcode[inds]
                  else if(val > 1) sample(Eventcode[inds], 1)
                  else NA_real_
                  })
    
    #    ID Eventdate.x Eventdate  Eventcode
    #  <dbl> <chr>       <chr>          <dbl>
    #1     1 2019-01-01  2019-01-01       201
    #2     2 2019-02-01  2019-02-11       201
    #3     3 2019-03-01  2019-03-01       205
    #4     4 2019-04-01  NA                NA
    #5     5 2019-05-01  NA                NA
    #6     6 2019-06-01  NA                NA
    

    The complicated logic in Eventcode data is for randomness, if you are ok selecting the 1st value like Eventdate you can simplify it to :

    left_join(data1, data2, by = 'ID') %>%
      group_by(ID, Eventdate.x) %>%
      summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1], 
                Eventcode = Eventcode[Eventdate.y >= Eventdate.x][1])