I want to process more than one bed files for finding overlapped regions. I read my data set as data frame, and how can I efficiently scanning two data set in parallel in order to detect where is the overlapped regions occurred. My approach is every time I am taking peak regions of the each cell of data frame objects as query, taking peak region of all row of another data frame in intervaltree, then searching overlapped regions. I am confused how to implement this in R. Please help about processing bed format files in bioinformatics. Appreciated if someone point me out how to do this ...
This is the simple example that what I want to achieve:
[1] chr1 [10171, 10226] * | MACS_peak_1 7.12
[2] chr1 [32698, 33079] * | MACS_peak_2 13.92
[3] chr1 [34757, 34794] * | MACS_peak_3 6.08
[4] chr1 [37786, 37833] * | MACS_peak_4 2.44
[5] chr1 [38449, 38484] * | MACS_peak_5 3.61
[6] chr1 [38584, 38838] * | MACS_peak_6 4.12
..
..
[] chrX [155191467, 155191508] * | MACS_peak_77948 3.80
[] chrX [155192786, 155192821] * | MACS_peak_77949 3.71
[] chrX [155206352, 155206433] * | MACS_peak_77950 3.81
[] chrX [155238796, 155238831] * | MACS_peak_77951 3.81
[n-1] chrX [155246563, 155246616] * | MACS_peak_77952 2.44
[n] chrX [155258442, 155258491] * | MACS_peak_77953 5.08
#step 1: read two bed files in R:
bed_1 <- as(import.bed(bedFile_1), "GRanges")
bed_2 <- as(import.bed(bedFile_2), "GRanges")
bed_3 <- as(import.bed(bedFile_3), "GRanges")
step 2: extract first row of the bed_1 (only take one specific interval as query). each row is considered as one specific genomic interval
peak <- bed_1[1] # only take one row once
bed_2.intvl <- GenomicRanges::GIntervalTree(bed_2)
#step 3: find overlapped regions:
overlap <- GenomicRanges::findOverlaps(peak, bed_2.intvl)
# step 4: go back to original bed_2 and look at which interval were overlapped with peak that comes from bed_1, what's the significance of each of these interval that comes from bed_2.
# step 5: then iterate next interval from bed_1 to repeat above process
Using Bioconductor, import the bed files using rtracklayer
library(rtracklayer)
bed1 = import("foo.bed")
bed2 = import("bar.bed")
Then query for 'overlaps'; it's not really clear what this means to you, maybe
bed1OverlappingBed2 = bed1[bed1 %over% bed2]
More flexibly, findOverlaps(bed1, bed2)
. Follow-up questions for this approach should be directed to the Bioconductor support forum.
Suppose we've input a query
and a subject
. Find all the hits
hits <- findOverlaps(query, subject)
This returns something that looks like a two-column matrix. The first column is an index into query, the second column is an index into subject. If the first element of query overlaps more than one element of subject, then there will be several rows where 1
appears several times under the query hits column, paired with the indexes of the subjects that this range overlaps. Take the original set of ranges and 'expand' them to match the hits, e.g.,
query[queryHits(hits)]
and find the intersection of the regions they overlap with
pintersect(query[queryHits(hits)], subject[subjectHits(hits)])
This has given you the element-wise overlaps, but has done so without doing an iteration.
By way of a small example, here are some ranges on 'chr1' represented as GRanges
objects (the bed files are also represented as GRanges, but with additional mcols()
with information from the bed file).
query = GRanges("chr1", IRanges(c(10, 20, 30), width=5))
subject = GRanges("chr1", IRanges(c(10, 14), width=9))
They look like
> query
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [10, 14] *
[2] chr1 [20, 24] *
[3] chr1 [30, 34] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> subject
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [10, 18] *
[2] chr1 [14, 22] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Here are the hits:
> hits = findOverlaps(query, subject)
> hits
Hits object with 3 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 1
[2] 1 2
[3] 2 2
-------
queryLength: 3
subjectLength: 2
You can see that the first range of query overlaps with ranges 1 and 2 of subject. Here are the intersecting ranges
> pintersect(query[queryHits(hits)], subject[subjectHits(hits)])
GRanges object with 3 ranges and 1 metadata column:
seqnames ranges strand | hit
<Rle> <IRanges> <Rle> | <logical>
[1] chr1 [10, 14] * | TRUE
[2] chr1 [14, 14] * | TRUE
[3] chr1 [20, 22] * | TRUE
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
So query 1 and subject 1 overlap from positions 10 to 14, query 1 and subject 2 overlap at position 14, and query 2 and subject 2 overlap at positions 20 to 22. (Bioconductor uses 1-based, closed intervals; UCSC uses 0-based half-open intervals; rtracklayer::import.bed()
does the right thing when importing files.