I am wondering what would be the best way to define all ranges which are not covered by given set of ranges. For example, if I have a set of genes with known coordinates:
dtGenes <- fread(
"id,start,end
1,1000,1300
2,1200,1500
3,1600,2600
4,3000,4000
")
Let's say I know that total length of the chromosome (for simplicity, assume they are all on the same chromosome) is 10000. So, finally I expect to have the following list of intergenic regions:
"startR,endR
0,1000
1500,1600
2600,3000
4000,10000
"
can Bioconductor's IRange
be useful here? or there is some other good way to solve this?
Using the Bioconductor package GenomicRanges, transform your original data to a GRanges
library(GenomicRanges)
gr <- with(dtGenes, GRanges("chr1", IRanges(start, end, names=id),
seqlengths=c(chr1=10000)))
Then find the gaps between your genes
gaps <- gaps(gr)
GRanges
knows about strand. You didn't specify a strand in the GRanges
constructor, so strand was assigned *
. There are therefore 'gaps' on the +, -, and * strands, and you're only interested in those on the * strand
> gaps[strand(gaps) == "*"]
GRanges with 4 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [ 1, 999] *
[2] chr1 [1501, 1599] *
[3] chr1 [2601, 2999] *
[4] chr1 [4001, 10000] *
---
seqlengths:
chr1
10000
Note the Bioconductor convention that chromosomes start at 1, and that the ranges are closed -- the start
and end
coordinates are included in the range. Use shift
and narrow
on gr
to make your ranges consistent with Bioconductor conventions. GRanges operations are efficient on 10's of millions of ranges.