Search code examples
rdplyrgroup-bydata.tableiranges

grouping overlapping regions based on a clustering factor in R


Using the foverlaps function from the data.table package I get overlapping regions (it shows only 25 lines but it's more than 50 thousand) and I would like to group the overlapping regions for each id taking into account the following criteria: If they have the same ID and overlapping regions belonging to the same or different group, then:

  1. group them all, 2) extend the range (i.e. start = min(overlapping item set) and end=max(overlapping item set)), and 3) place the name of the group of the maximum score. For example, given the data set:
dt <- data.table::data.table(
ID=c("1015_4_1_1","1015_4_1_1","1015_4_1_1","103335_0_1_2","103335_0_1_2",
"103335_0_1_2","11099_0_1_1","11099_0_1_1","11099_0_1_1","11099_0_1_1","11099_0_1_1", 
"11702_0_1_1","11702_0_1_1","11702_0_1_1","11702_0_1_5","11702_0_1_5","11702_0_1_5",
"140331_0_1_1","140331_0_1_1","140331_0_1_1","14115_0_1_7","14115_0_1_7", 
"14115_0_1_7","14115_0_1_8","14115_0_1_8"),
start=c(193,219,269,149,149,163,51,85,314,331,410,6193,6269,6278,6161,6238,6246,303,304,316,1525,1526,1546,1542,1543),
end=c(307,273,399,222,235,230,158,128,401,428,507,6355,6337,6356,6323,6305,6324,432,396,406,1603,1688,1612,1620,1705),
group=c("R7","R5","R5","R4","R5","R6","R7","R5","R4","R5","R5","R5","R6","R4","R5","R6","R4","R5","R4","R6","R4","R5","R6","R4","R5"),
score=c(394,291,409,296,319,271,318,252,292,329,252,524,326,360,464,340,335,515,506,386,332,501,307,308,443)
)

The expected result is:

#  1015_4_1_1   193  399    R5   409
#  103335_0_1_2   149  235    R5   319
#  11099_0_1_1    51  158    R7   318
#  11099_0_1_1   314  507    R5   329
#  11702_0_1_1  6193 6356    R5   524
#  11702_0_1_5  6161 6324    R5   464
#  140331_0_1_1   303  432    R5   515
#  14115_0_1_7  1525 1705    R5   501

note that for each ID there may be subgroups of regions that do not overlap each other, for example in "11099_0_1_1" rows 7 and 8 are grouped in one subgroup and the rest in another subgroup.

I have no experience with GenomicRanges or IRanges, and read in another comment that data.table is usually faster. So, since I was expecting a lot of overlapping regions, I started with foverlaps from data.table, but I don't know how to proceed. I hope you can help me, and thank you very much in advance


Solution

  • If your group is the full ID, then you could do:

    dt <- dt[
      ,IDy := cumsum(fcoalesce(+(start > (shift(cummax(end), type = 'lag') + 1L)), 0L)), by = ID][
        , .(start = min(start), end = max(end),
            group = group[which.max(score)], 
            score = max(score)),
        by = .(ID, IDy)][, IDy := NULL]
    

    Output (additional 443 score added as it represents the 14115_0_1_8):

                 ID start  end group score
    1:   1015_4_1_1   193  399    R5   409
    2: 103335_0_1_2   149  235    R5   319
    3:  11099_0_1_1    51  158    R7   318
    4:  11099_0_1_1   314  507    R5   329
    5:  11702_0_1_1  6193 6356    R5   524
    6:  11702_0_1_5  6161 6324    R5   464
    7: 140331_0_1_1   303  432    R5   515
    8:  14115_0_1_7  1525 1688    R5   501
    9:  14115_0_1_8  1541 1705    R5   443
    

    In case your ID group are actually only the numbers before the underscore, then:

    library(data.table)
    
    dt <- dt[, IDx := sub('_.*', '', ID)][
      , IDy := cumsum(fcoalesce(+(start > (shift(cummax(end), type = 'lag') + 1L)), 0L)), by = IDx][
        , .(ID = ID[which.max(score)],
            start = min(start), end = max(end),
            group = group[which.max(score)], 
            score = max(score)),
        by = .(IDx, IDy) 
        ][, c('IDx', 'IDy') := NULL]
    

    Output (lacks 464 from your example):

    dt
    
                 ID start  end group score
    1:   1015_4_1_1   193  399    R5   409
    2: 103335_0_1_2   149  235    R5   319
    3:  11099_0_1_1    51  158    R7   318
    4:  11099_0_1_1   314  507    R5   329
    5:  11702_0_1_1  6161 6356    R5   524
    6: 140331_0_1_1   303  432    R5   515
    7:  14115_0_1_7  1525 1705    R5   501
    

    The above assumes that your start variable is already ordered from lowest to highest. If this is not the case, just do the setorder(dt, start) before executing the above code.