Search code examples
roverlapoverlappingbioconductor

Concatenate individual genomic intervals into populational regions


I would like to concatenate individual genomic intervals into common regions.

My input:

dfin <- "chr start end sample type
        1   10    20   NE1    loss
        1   5     15   NE2    gain
        1   25    30   NE1    gain
        2   40    50   NE1    loss
        2   40    60   NE2    loss
        3   20    30   NE1    gain"
dfin <- read.table(text=dfin, header=T)

My expected output:

dfout <- "chr start end samples type
        1   5     20   NE1-NE2  both
        1   25    30   NE1      gain
        2   40    60   NE1-NE2  loss
        3   20    30   NE1      gain"
dfout <- read.table(text=dfout, header=T)

The intervals in dfin will never overlap in the same animal, just between animals (columns sample and samples, respectively). The column type have two factors (lossand gain) in dfin and is expected to have three factors in dfout (loss, gain and both, which occur when the concatenated region in dfout was based on both loss and gain).

Any idea to deal with that?

*Updated for @David Arenburg


Solution

  • (expanding on the comment) You could use the "IRanges" package:

    library(IRanges)
    
    #build an appropriate object
    dat = RangedData(space = dfin$chr, 
                     IRanges(dfin$start, dfin$end), 
                     sample = dfin$sample, 
                     type = dfin$type)
    dat
    #concatenate overlaps with an extra step of saving the concatenation mappings
    ans = RangedData(reduce(ranges(dat), with.revmap = TRUE))
    ans
    

    Could not figure out how to avoid reduce losing columns of "RangedData" object, but having the mappings saved we could do something like (there might be a more appropriate -according to "IRanges"- way to extract mappings, but I was not able to find it):

    tmp = elementMetadata(ranges(ans)@unlistData)$revmap@partitioning
    maps = rep(seq_along(start(tmp)), width(tmp))
    maps
    #[1] 1 1 2 3 3 4
    

    Having the mappings of interval concatenation, we could aggregate "sample" and "type" to get the final form. E.g.:

    tapply(dfin$sample, maps, function(X) paste(unique(X), collapse = "-"))
    #        1         2         3         4 
    #"NE1-NE2"     "NE1" "NE1-NE2"     "NE1"