Search code examples
rdataframerangebioinformaticsoverlapping

Find overlapping ranges in a dataframe and assign them values


A simpler version of the original question which I asked but nobody answered it yet.

I have a huge input file (a representative sample of which is shown below as input):

> input
           CT1           CT2           CT3
1 chr1:200-400  chr1:250-450  chr1:400-800
2 chr1:800-970  chr2:200-500  chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400

I want to process it by following a rule (described below) so that I get an output like:

 > output
              CT1 CT2 CT3
chr1:200-400    1   1   0
chr1:800-970    1   0   1
chr2:300-700    1   1   0
chr1:250-450    1   1   1
chr2:200-500    1   1   0
chr2:600-1000   1   1   1
chr1:400-800    0   1   1
chr1:700-870    1   0   1
chr2:700-1400   0   1   1

Rule: Take every index (the first in this case is chr1:200-400) of the dataframe, see if it overlaps with any other value in the dataframe. If yes, write 1 below that column in which it exists, if not write 0.

For example, if we take 1st index of the input input[1,1] which is chr1:200-400. As it exists in column 1 we will write 1 below it. Now we will check if this range overlap with any other range which exists in any of the other columns in the input. This value overlaps only with the first value (chr1:250-450) of the second column (CT2), therefore, we write 1 below that as well. As there is no overlap with any of the values in CT3, we write 0 below CT3 in the output dataframe.

Here are the dput of input and output:

> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400", 
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450", 
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800", 
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1", 
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L, 
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L, 
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400", 
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500", 
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))

Solution

  • A possible solution using the data.table-package:

    # load the 'data.table'-package and convert 'input' to a data.table with 'setDT'
    library(data.table)
    setDT(input)
    
    # reshape 'input' to long format and split the strings in 3 columns
    DT <- melt(input, measure.vars = 1:3)[, c('chr','low','high') := tstrsplit(value, split = ':|-', type.convert = TRUE)
                                          , by = variable][]
    
    # create aggregation function; needed in the ast reshape step
    f <- function(x) as.integer(length(x) > 0)
    
    # cartesian self join & reshape result back to wide format with aggregation function
    DT[DT, on = .(chr, low < high, high > low), allow.cartesian = TRUE
       ][, dcast(.SD, value ~ i.variable, fun = f)]
    

    which gives:

               value CT1 CT2 CT3
    1:  chr1:200-400   1   1   0
    2:  chr1:250-450   1   1   1
    3:  chr1:400-800   0   1   1
    4:  chr1:700-870   1   0   1
    5:  chr1:800-970   1   0   1
    6:  chr2:200-500   1   1   0
    7:  chr2:300-700   1   1   0
    8: chr2:600-1000   1   1   1
    9: chr2:700-1400   0   1   1