Search code examples
roverlapoverlappinggenome

Common genomic intervals in R


I would like to infer shared genomic interval between different samples.

My input:

sample    chr start end
NE001      1   100  200
NE001      2   100  200
NE002      1   50   150
NE002      2   50   150
NE003      2   250  300

My expected output:

chr start end  freq
1    100  150   2
2    100  150   2

Where the "freq" is the how many samples have contribuited to infer the shared region. In the above example freq = 2 (NE001 and NE002).

Cheers!


Solution

  • If your data is in a data.frame (see below), using the Bioconductor GenomicRanges package I create a GRanges instance, keeping the non-range columns too

    library(GenomicRanges)
    gr <- makeGRangesFromDataFrame(df, TRUE)
    

    The discrete ranges represented by the data are given by the disjoin function, and the overlap between the disjoint ranges ('query') and your original ('subject') are

    d <- disjoin(gr)
    olaps <- findOverlaps(d, gr)
    

    Split the sample information associated with each overlapping subject with the corresponding query, and associate it with the disjoint GRanges as

    mcols(d) <- splitAsList(gr$sample[subjectHits(olaps)], queryHits(olaps))
    

    leading to for instance

    > d[elementLengths(d$value) > 1]
    GRanges with 2 ranges and 1 metadata column:
          seqnames     ranges strand |           value
             <Rle>  <IRanges>  <Rle> | <CharacterList>
      [1]        1 [100, 150]      * |     NE001,NE002
      [2]        2 [100, 150]      * |     NE001,NE002
      ---
      seqlengths:
        1  2
       NA NA
    

    Here's how I input your data:

    txt <- "sample    chr start end
    NE001      1   100  200
    NE001      2   100  200
    NE002      1   50   150
    NE002      2   50   150
    NE003      2   250  300"
    df <- read.table(textConnection(txt), header=TRUE, stringsAsFactors=FALSE)