I have two independent two datasets, one contains event date. Each ID has only one "Eventdate". As follows:
data1 <- data.frame("ID" = c(1,2,3,4,5,6), "Eventdate" = c("2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01", "2019-05-01", "2019-06-01"))
data1
ID Eventdate
1 1 2019-01-01
2 2 2019-02-01
3 3 2019-03-01
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
In another dataset, one ID have multiple event name (Eventcode) and its event date (Eventdate). As follows:
data2 <- data.frame("ID" = c(1,1,2,3,3,3,4,4,7), "Eventcode"=c(201,202,201,204,205,206,209,208,203),"Eventdate" = c("2019-01-01", "2019-01-01", "2019-02-11", "2019-02-15", "2019-03-01", "2019-03-15", "2019-03-10", "2019-03-20", "2019-06-02"))
data2
ID Eventcode Eventdate
1 1 201 2019-01-01
2 1 202 2019-01-01
3 2 201 2019-02-11
4 3 204 2019-02-15
5 3 205 2019-03-01
6 3 206 2019-03-15
7 4 209 2019-03-10
8 4 208 2019-03-20
9 7 203 2019-06-02
Two datasets were linked by ID. The ID of two datasets were not all the same. I would like to select cases in data2 with conditions:
Then merge the selected data2 into data1.
Expected results as follows:
data1
ID Eventdate Eventdate.data2 Eventcode
1 1 2019-01-01 2019-01-01 201
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
or
data1
ID Eventdate Eventdate.data2 Eventcode
1 1 2019-01-01 2019-01-01 202
2 2 2019-02-01 2019-02-11 201
3 3 2019-03-01 2019-03-01 205
4 4 2019-04-01
5 5 2019-05-01
6 6 2019-06-01
Thank you very very much!
You can try this approach :
library(dplyr)
left_join(data1, data2, by = 'ID') %>%
group_by(ID, Eventdate.x) %>%
summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1],
Eventcode = {
inds <- Eventdate.y >= Eventdate.x
val <- sum(inds, na.rm = TRUE)
if(val == 1) Eventcode[inds]
else if(val > 1) sample(Eventcode[inds], 1)
else NA_real_
})
# ID Eventdate.x Eventdate Eventcode
# <dbl> <chr> <chr> <dbl>
#1 1 2019-01-01 2019-01-01 201
#2 2 2019-02-01 2019-02-11 201
#3 3 2019-03-01 2019-03-01 205
#4 4 2019-04-01 NA NA
#5 5 2019-05-01 NA NA
#6 6 2019-06-01 NA NA
The complicated logic in Eventcode
data is for randomness, if you are ok selecting the 1st value like Eventdate
you can simplify it to :
left_join(data1, data2, by = 'ID') %>%
group_by(ID, Eventdate.x) %>%
summarise(Eventdate = Eventdate.y[Eventdate.y >= Eventdate.x][1],
Eventcode = Eventcode[Eventdate.y >= Eventdate.x][1])