I would like to infer shared genomic interval between different samples.
My input:
sample chr start end
NE001 1 100 200
NE001 2 100 200
NE002 1 50 150
NE002 2 50 150
NE003 2 250 300
My expected output:
chr start end freq
1 100 150 2
2 100 150 2
Where the "freq" is the how many samples have contribuited to infer the shared region. In the above example freq = 2 (NE001 and NE002).
Cheers!
If your data is in a data.frame (see below), using the Bioconductor GenomicRanges package I create a GRanges instance, keeping the non-range columns too
library(GenomicRanges)
gr <- makeGRangesFromDataFrame(df, TRUE)
The discrete ranges represented by the data are given by the disjoin
function, and the overlap between the disjoint ranges ('query') and your original ('subject') are
d <- disjoin(gr)
olaps <- findOverlaps(d, gr)
Split the sample information associated with each overlapping subject with the corresponding query, and associate it with the disjoint GRanges as
mcols(d) <- splitAsList(gr$sample[subjectHits(olaps)], queryHits(olaps))
leading to for instance
> d[elementLengths(d$value) > 1]
GRanges with 2 ranges and 1 metadata column:
seqnames ranges strand | value
<Rle> <IRanges> <Rle> | <CharacterList>
[1] 1 [100, 150] * | NE001,NE002
[2] 2 [100, 150] * | NE001,NE002
---
seqlengths:
1 2
NA NA
Here's how I input your data:
txt <- "sample chr start end
NE001 1 100 200
NE001 2 100 200
NE002 1 50 150
NE002 2 50 150
NE003 2 250 300"
df <- read.table(textConnection(txt), header=TRUE, stringsAsFactors=FALSE)