Search code examples
rparallel-processingbioinformaticsbed

How to process two bed files for finding overlapped regions in parallel?


I want to process more than one bed files for finding overlapped regions. I read my data set as data frame, and how can I efficiently scanning two data set in parallel in order to detect where is the overlapped regions occurred. My approach is every time I am taking peak regions of the each cell of data frame objects as query, taking peak region of all row of another data frame in intervaltree, then searching overlapped regions. I am confused how to implement this in R. Please help about processing bed format files in bioinformatics. Appreciated if someone point me out how to do this ...

This is the simple example that what I want to achieve:

  [1]     chr1 [10171, 10226]      * | MACS_peak_1      7.12
  [2]     chr1 [32698, 33079]      * | MACS_peak_2     13.92
  [3]     chr1 [34757, 34794]      * | MACS_peak_3      6.08
  [4]     chr1 [37786, 37833]      * | MACS_peak_4      2.44
  [5]     chr1 [38449, 38484]      * | MACS_peak_5      3.61
  [6]     chr1 [38584, 38838]      * | MACS_peak_6      4.12
  ..
  ..
  []     chrX [155191467, 155191508]      * | MACS_peak_77948      3.80
  []     chrX [155192786, 155192821]      * | MACS_peak_77949      3.71
  []     chrX [155206352, 155206433]      * | MACS_peak_77950      3.81
  []     chrX [155238796, 155238831]      * | MACS_peak_77951      3.81
  [n-1]     chrX [155246563, 155246616]      * | MACS_peak_77952      2.44
  [n]     chrX [155258442, 155258491]      * | MACS_peak_77953      5.08



  #step 1: read two bed files in R:

    bed_1 <- as(import.bed(bedFile_1), "GRanges")
    bed_2 <- as(import.bed(bedFile_2), "GRanges")
    bed_3 <- as(import.bed(bedFile_3), "GRanges")

  step 2: extract first row of the bed_1 (only take one specific interval as query). each row is considered as one specific genomic interval

    peak <- bed_1[1]      # only take one row once
    bed_2.intvl <- GenomicRanges::GIntervalTree(bed_2)

  #step 3: find overlapped regions:

    overlap <- GenomicRanges::findOverlaps(peak, bed_2.intvl)
  # step 4: go back to original bed_2 and look at which interval were overlapped with peak that comes from bed_1, what's the significance of each of these interval that comes from bed_2.

  # step 5: then iterate next interval from bed_1 to repeat above process

Solution

  • Using Bioconductor, import the bed files using rtracklayer

    library(rtracklayer)
    bed1 = import("foo.bed")
    bed2 = import("bar.bed")
    

    Then query for 'overlaps'; it's not really clear what this means to you, maybe

    bed1OverlappingBed2 = bed1[bed1 %over% bed2]
    

    More flexibly, findOverlaps(bed1, bed2). Follow-up questions for this approach should be directed to the Bioconductor support forum.

    Suppose we've input a query and a subject. Find all the hits

    hits <- findOverlaps(query, subject)
    

    This returns something that looks like a two-column matrix. The first column is an index into query, the second column is an index into subject. If the first element of query overlaps more than one element of subject, then there will be several rows where 1 appears several times under the query hits column, paired with the indexes of the subjects that this range overlaps. Take the original set of ranges and 'expand' them to match the hits, e.g.,

    query[queryHits(hits)]
    

    and find the intersection of the regions they overlap with

    pintersect(query[queryHits(hits)], subject[subjectHits(hits)])
    

    This has given you the element-wise overlaps, but has done so without doing an iteration.

    By way of a small example, here are some ranges on 'chr1' represented as GRanges objects (the bed files are also represented as GRanges, but with additional mcols() with information from the bed file).

    query = GRanges("chr1", IRanges(c(10, 20, 30), width=5))
    subject = GRanges("chr1", IRanges(c(10, 14), width=9))
    

    They look like

    > query
    GRanges object with 3 ranges and 0 metadata columns:
          seqnames    ranges strand
             <Rle> <IRanges>  <Rle>
      [1]     chr1  [10, 14]      *
      [2]     chr1  [20, 24]      *
      [3]     chr1  [30, 34]      *
      -------
      seqinfo: 1 sequence from an unspecified genome; no seqlengths
    > subject
    GRanges object with 2 ranges and 0 metadata columns:
          seqnames    ranges strand
             <Rle> <IRanges>  <Rle>
      [1]     chr1  [10, 18]      *
      [2]     chr1  [14, 22]      *
      -------
      seqinfo: 1 sequence from an unspecified genome; no seqlengths
    

    Here are the hits:

    > hits = findOverlaps(query, subject)
    > hits
    Hits object with 3 hits and 0 metadata columns:
          queryHits subjectHits
          <integer>   <integer>
      [1]         1           1
      [2]         1           2
      [3]         2           2
      -------
      queryLength: 3
      subjectLength: 2
    

    You can see that the first range of query overlaps with ranges 1 and 2 of subject. Here are the intersecting ranges

    > pintersect(query[queryHits(hits)], subject[subjectHits(hits)])
    GRanges object with 3 ranges and 1 metadata column:
          seqnames    ranges strand |       hit
             <Rle> <IRanges>  <Rle> | <logical>
      [1]     chr1  [10, 14]      * |      TRUE
      [2]     chr1  [14, 14]      * |      TRUE
      [3]     chr1  [20, 22]      * |      TRUE
      -------
      seqinfo: 1 sequence from an unspecified genome; no seqlengths
    

    So query 1 and subject 1 overlap from positions 10 to 14, query 1 and subject 2 overlap at position 14, and query 2 and subject 2 overlap at positions 20 to 22. (Bioconductor uses 1-based, closed intervals; UCSC uses 0-based half-open intervals; rtracklayer::import.bed() does the right thing when importing files.