Search code examples
rmergedata.tableoverlap

Merging rows with overlapping values


I've got a data.table that has UniqueID's for each row, an age column, and 2 columns with a confidence interval for the age. What I want to do is to merge rows that have an overlapping CI so the CI returned is the min/max of the overlap.

   ID   Age AgeMax AgeMin
1:  2 48073  49213  46933
2:  3 49002  49638  48366
3:  7 44297  44706  43888

The returned result for this example would then be:

ID  Age    AgeMax   AgeMin
2   48409   49638   46933
7   44297   44706   43888

As ID 2 and 3 have overlapping values in AgeMax and AgeMin. The AgeMax for ID == 2, is less than the AgeMax for ID == 3, but greater than the AgeMin for ID == 3 and therefore they overlap. ID 7 doesn't overlap with the other rows so it's returned as it was.

I actually don't mind what the ID and Age being returned is, as long as it's from one of the overlapping ID's

My attempt is below, but I'm not getting the right results

library(data.table)
# sequence of years
step <- 10
window <- 30
startYear <- -60000+(0.5*window)
endYear <- 0-(0.5*window)
yrSeq <- abs(seq(startYear, endYear, step))

# Example DT
DT <- structure(list(ID = c(2L, 3L, 7L), Age = c(48073L, 49002L, 44297L
), AgeMax = c(49213L, 49638L, 44706L), AgeMin = c(46933L, 48366L, 
43888L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"
))

# split into a list to expand the CI's
s <- split(DT, DT$ID)

# Expand the CI's, to the nearest year in the seq
# merge back into a DT
d_seq <- rbindlist(lapply(s, function(x) {
      data.table(ID = x$ID, Yr = yrSeq[between(yrSeq, x$AgeMin, x$AgeMax)])}))

# remove duplicated years and return min and max years for each ID
d_seq <- d_seq[!duplicated(d_seq$Yr),]
d_seq <- d_seq[, .(AgeMin = min(Yr), AgeMax = max(Yr)), by = ID]

# merge with the original DT and select columns
DT <- merge(DT, d_seq, by = "ID")
DT <- DT[, c(1,2,5,6)]

Unfortunately, this isn't working as ID == 3 is being returned even though is overlaps with ID == 2 (as shown above) and now the AgeMin and AgeMax for ID == 2 doesn't cover the Age for that ID!

   ID   Age AgeMin AgeMax
1:  2 48073    46935    49205
2:  3 49002    49215    49635
3:  7 44297    43895    44705

I'm sure I'm overthinking this and that there must be a simple way that returns what I need, unfortunately I haven't been able to find any solutions.

I've already tried to modify the examples here and here.

Here is an extra example data.table to test on.

testDT <- structure(list(ID = c(54L, 57L, 58L, 60L, 61L, 62L, 64L, 180L
), Age = c(14219L, 13989L, 13883L, 13482L, 13403L, 13383L, 13340L, 
13994L), AgeMax = c(14343L, 14087L, 13972L, 13540L, 13465L, 13442L, 
13407L, 14083L), AgeMin = c(14095L, 13891L, 13794L, 13424L, 13341L, 
13324L, 13273L, 13905L)), row.names = c(NA, -8L), class = c("data.table", 
"data.frame"))

Solution

  • Here is a data.table solution

    library(data.table)
    setDT(testDT)
    
    testDT[order(AgeMin)
          ][, .(AgeMin=min(AgeMin), AgeMax=max(AgeMax)),
           by=.(group=cumsum(c(1, tail(AgeMin, -1) > head(AgeMax, -1))))]
    #>    group AgeMin AgeMax
    #> 1:     1  13273  13540
    #> 2:     2  13794  14087
    #> 3:     3  14095  14343
    

    The key of this solution is getting the group of overlapping periods.

    Let's say we have two ranges p1 and p2. They have start and end named as start1,end1,start2end2.

    There are only two conditions where p1 and p2 are not overalpping.

    1. start1 > end2
      OR
    2. end1 < start2

    Since we already ordered Agemin ascendingly, we only need to consider conditioon 1 only. Then we can use cumsum to get the group indentifier.