Search code examples
rmergedata.tablerbind

Joining two data.tables in r: removing overlap duplicates while keeping duplicates in each separate dataset


There are many questions and answers about joining two datasets, but I'm having trouble finding a solution to my current problem. I'm using data.table here.

I have two datasets, dtx and dty, with toy examples provided:

library(data.table)

dtx <- 
  structure(list(ID = c("a", "a", "a", "a", "c", "c", "c", "c", 
                        "d", "d", "d", "d", "d", "d", "d", "d", "e", "e", "f", "f", "f", 
                        "f", "f", "f", "f", "f", "g", "g", "g", "g", "g", "g", "g", "g"
  ), date = structure(c(939340800, 939340800, 949622400, 949622400, 
                        887414400, 887414400, 920332800, 920332800, 831686400, 831686400, 
                        831686400, 845078400, 845078400, 969062400, 969062400, 975369600, 
                        979689600, 979689600, 892598400, 892598400, 921801600, 921974400, 
                        968284800, 968284800, 968284800, 968284800, 927158400, 927158400, 
                        993081600, 993081600, 993081600, 993081600, 1057190400, 1057190400
  ), class = c("POSIXct", "POSIXt"), tzone = "UTC"), code = c(1215, 
                                                              1220, 1320, 1320, 236, 300, 1245, 850, 1415, 1415, 1279, 230, 
                                                              230, 1115, 1215, 8749, 1212, 1212, 1112, 1112, 4561, 8145, 8145, 
                                                              1497, 1112, 1112, 258, 1112, 230, 240, 1112, 1445, 260, 1112), 
  dataset = c("x", "x", "x", "x", "x", "x", "x", "x", "x", 
              "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", 
              "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", 
              "x")), row.names = c(NA, -34L), class = c("data.table", "data.frame"
              ))

dty <-
  structure(list(ID = c("b", "b", "b", "b", "b", "b", "c", "c", 
                        "c", "c", "c", "d", "d", "d", "d", "d", "e", "e", "e", "e", "e", 
                        "e", "e", "e", "f", "f", "f", "f", "g", "g", "g", "g", "g"), 
                 date = structure(c(1055203200, 1055203200, 1055635200, 1058918400, 
                                    1058918400, 1074211200, 974764800, 974764800, 974764800, 
                                    979516800, 979516800, 975369600, 983491200, 983491200, 984528000, 
                                    987984000, 979689600, 979689600, 992217600, 992217600, 994896000, 
                                    995068800, 999043200, 999043200, 968284800, 968284800, 968284800, 
                                    968284800, 993081600, 993081600, 993081600, 993081600, 1057190400
                 ), class = c("POSIXct", "POSIXt"), tzone = "UTC"), code = c(1112, 
                                                                             890, 8125, 2369, 1485, 1112, 1645, 1645, 794, 236, 4578, 
                                                                             8749, 230, 1114, 690, 720, 1212, 1212, 1112, 1112, 2060, 
                                                                             310, 1415, 310, 8145, 1497, 1112, 1112, 230, 240, 1112, 1445, 
                                                                             1112), dataset = c("y", "y", "y", "y", "y", "y", "y", "y", 
                                                                                                "y", "y", "y", "y", "y", "y", "y", "y", "y", "y", "y", "y", 
                                                                                                "y", "y", "y", "y", "y", "y", "y", "y", "y", "y", "y", "y", 
                                                                                                "y")), row.names = c(NA, -33L), class = c("data.table", "data.frame"
                                                                                                ))

In each dataset, there is an ID that relates to a specific individual, a date for when an event occurs, a code for the type of event, and a dummy variable to indicate the dataset for each row (this is to keep track of which dataset a row comes from after joining).

An individual can have multiple rows representing multiple events across multiple dates. Further, an individual can experience multiple events on the same date, including multiple events of the same type (i.e., the same event code can appear more than once for an individual on a specific date).

I want to join these two datasets, where there are unique and shared rows between them. In joining them, I want to remove duplicate entries: where a specific ID-date-code combination appears in both datasets, I only want to keep the version from dtx. Further, I want to keep a record of which dataset the row is originally from, which is why I have the dataset dummy column

Here is the desired output:

dtresult <- structure(list(ID = c("a", "a", "a", "a", "b", "b", "b", "b", 
                                  "b", "b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "d", "d", 
                                  "d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "e", "e", "e", 
                                  "e", "e", "e", "e", "e", "f", "f", "f", "f", "f", "f", "f", "f", 
                                  "g", "g", "g", "g", "g", "g", "g", "g"), date = structure(c(939340800, 
                                                                                              939340800, 949622400, 949622400, 1055203200, 1055203200, 1055635200, 
                                                                                              1058918400, 1058918400, 1074211200, 887414400, 887414400, 920332800, 
                                                                                              920332800, 974764800, 974764800, 974764800, 979516800, 979516800, 
                                                                                              831686400, 831686400, 831686400, 845078400, 845078400, 969062400, 
                                                                                              969062400, 975369600, 983491200, 983491200, 984528000, 987984000, 
                                                                                              979689600, 979689600, 992217600, 992217600, 994896000, 995068800, 
                                                                                              999043200, 999043200, 892598400, 892598400, 921801600, 921974400, 
                                                                                              968284800, 968284800, 968284800, 968284800, 927158400, 927158400, 
                                                                                              993081600, 993081600, 993081600, 993081600, 1057190400, 1057190400
                                  ), class = c("POSIXct", "POSIXt"), tzone = "UTC"), code = c(1215, 
                                                                                              1220, 1320, 1320, 1112, 890, 8125, 2369, 1485, 1112, 236, 300, 
                                                                                              1245, 850, 1645, 1645, 794, 236, 4578, 1415, 1415, 1279, 230, 
                                                                                              230, 1115, 1215, 8749, 230, 1114, 690, 720, 1212, 1212, 1112, 
                                                                                              1112, 2060, 310, 1415, 310, 1112, 1112, 4561, 8145, 8145, 1497, 
                                                                                              1112, 1112, 258, 1112, 230, 240, 1112, 1445, 260, 1112), dataset = c("x", 
                                                                                                                                                                   "x", "x", "x", "y", "y", "y", "y", "y", "y", "x", "x", "x", "x", 
                                                                                                                                                                   "y", "y", "y", "y", "y", "x", "x", "x", "x", "x", "x", "x", "x", 
                                                                                                                                                                   "y", "y", "y", "y", "x", "x", "y", "y", "y", "y", "y", "y", "x", 
                                                                                                                                                                   "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", 
                                                                                                                                                                   "x", "x")), row.names = c(NA, -55L), class = c("data.table", 
                                                                                                                                                                                                                  "data.frame"))

I am having trouble joining the two datasets in this scenario, because an individual can have multiple of the same event type on a single date. This means I can't simply use unique or duplicated to identify the doubled up rows.

In my first attempt at joining the datasets I thought I would use rbindlist:

l = list(dtx, dty)
dtxy = rbindlist(l, use.names = TRUE)
dtxy = dtxy[order(ID, date, code, dataset)]

But then I realized I couldn't remove the overlapping rows without also taking out instances where an individual has multiple of the same event on one date.

Then I tried using merge:

dtxy = merge(dtx,
             dty,
             all = TRUE,
             by = c("ID", "date", "code"))

But this method doesn't deal with instances where the same event type occurs more than once on the same date and is duplicated across the two datasets!

I'm really struggling with this one - I also thought to use funion, but this is essentially the same as merge.


Solution

  • You could remove the overlaps coming from y:

    l = list(dtx, dty)
    dtxy = rbindlist(l, use.names = TRUE)
    
    overlaps = merge(dtx,dty,by=c("ID","date","code"))[,.(ID,date,code,dataset = dataset.y)]
    
    dtresultnew <- overlaps[dtxy,.(ID,date,code,x.dataset,i.dataset),on = .(ID,date,code,dataset)][
             is.na(x.dataset),.(ID,date,code,dataset=i.dataset)]
    
    identical(dtresult[order(ID,date,code)],dtresultnew[order(ID,date,code)])
    [1] TRUE