Search code examples
rbioconductoriranges

Find all ranges outside of defined set of ranges


I am wondering what would be the best way to define all ranges which are not covered by given set of ranges. For example, if I have a set of genes with known coordinates:

dtGenes <- fread(
  "id,start,end
 1,1000,1300
 2,1200,1500
 3,1600,2600
 4,3000,4000
")

Let's say I know that total length of the chromosome (for simplicity, assume they are all on the same chromosome) is 10000. So, finally I expect to have the following list of intergenic regions:

"startR,endR
    0,1000
 1500,1600
 2600,3000
 4000,10000
"

can Bioconductor's IRange be useful here? or there is some other good way to solve this?


Solution

  • Using the Bioconductor package GenomicRanges, transform your original data to a GRanges

    library(GenomicRanges)
    gr <- with(dtGenes, GRanges("chr1", IRanges(start, end, names=id),
                                seqlengths=c(chr1=10000)))
    

    Then find the gaps between your genes

    gaps <- gaps(gr)
    

    GRanges knows about strand. You didn't specify a strand in the GRanges constructor, so strand was assigned *. There are therefore 'gaps' on the +, -, and * strands, and you're only interested in those on the * strand

    > gaps[strand(gaps) == "*"]
    GRanges with 4 ranges and 0 metadata columns:
          seqnames        ranges strand
             <Rle>     <IRanges>  <Rle>
      [1]     chr1 [   1,   999]      *
      [2]     chr1 [1501,  1599]      *
      [3]     chr1 [2601,  2999]      *
      [4]     chr1 [4001, 10000]      *
      ---
      seqlengths:
        chr1
       10000
    

    Note the Bioconductor convention that chromosomes start at 1, and that the ranges are closed -- the start and end coordinates are included in the range. Use shift and narrow on gr to make your ranges consistent with Bioconductor conventions. GRanges operations are efficient on 10's of millions of ranges.