Search code examples
rbioinformaticsbioconductoriranges

Break region into smaller regions based on cutoff


This is I assume a somewhat simple programming issue, but I've been struggling with it. Mostly because I don't know the right words to use, perhaps?

Given a set of "ranges" (in the form of 1-a set of numbers as below, 2-IRanges, or 3-GenomicRanges), I'd like to split it into a set of smaller ranges.

Example Beginning:

Chr    Start     End
1        1        10000
2        1        5000

Example size of breaks: 2000

New dataset:

Chr    Start    End
1        1       2000
1        2001    4000
1        4001    6000
1        6001    8000
1        8001    10000
2        1       2000
2        2001    4000
2        4001    5000

I'm doing this in R. I know I could generate these simply with seq, but I'd like to be able to do it based on a list/df of regions instead of having to manually do it every time I have a new list of regions.

Here's an example I've made using seq:

Given 22 chromosomes, loop through them and break each into pieces

# initialize df
Regions <- data.frame(Chromosome = c(), Start = c(), End = c())
# for each row, do the following
for(i in 1:nrow(Chromosomes)){
     # create a sequence from the minimum start to the max end by some value
     breks <- seq(min(Chromosomes$Start[Chromosomes$Chromosome == i]), max(Chromosomes$End[Chromosomes$Chromosome == i]), by=2000000)

     # put this into a dataframe
     database <- data.frame(Chromosome = i, Start = breks, End = c(breks[2:length(breks)]-1, max(Chromosomes$End[Chromosomes$Chromosome == i])))

     # bind with what we already have
     Regions <- rbind(Regions, database)
     rm(database)
}

This works fine, I'm wondering if there is something built into a package already to do this as a one-liner OR that is more flexible, as this has its limitations.


Solution

  • Using the R / Bioconductor package GenomicRanges, here are your initial ranges

    library(GenomicRanges)
    rngs = GRanges(1:2, IRanges(1, c(10000, 5000)))
    

    and then create a sliding window across the genome, generated first as a list (one set of tiles per chromosome) and then unlisted for the format you have in your question

    > windows = slidingWindows(rngs, width=2000, step=2000)
    > unlist(windows)
    GRanges object with 8 ranges and 0 metadata columns:
          seqnames        ranges strand
             <Rle>     <IRanges>  <Rle>
      [1]        1 [   1,  2000]      *
      [2]        1 [2001,  4000]      *
      [3]        1 [4001,  6000]      *
      [4]        1 [6001,  8000]      *
      [5]        1 [8001, 10000]      *
      [6]        2 [   1,  2000]      *
      [7]        2 [2001,  4000]      *
      [8]        2 [4001,  5000]      *
    
      -------
      seqinfo: 2 sequences from an unspecified genome; no seqlengths
    

    Coerce from / to a data.frame with as(df, "GRanges") or as(unlist(tiles), "data.frame").

    Find help at ?"slidingWindows,GenomicRanges-method" (tab completion is your friend, ?"slidingW<tab>).

    Embarrassingly, this seems to be implemented only in the 'devel' version of GenomicRanges (v. 1.25.93?); tile does something similar but rounds the width of ranges to be approximately equal while spanning the width of the GRanges. Here is a poor-man's version

    windows <- function(gr, width, withMcols=FALSE) {
        starts <- Map(seq, start(rngs), end(rngs), by=width)
        ends <- Map(function(starts, len) c(tail(starts, -1) - 1L, len),
                    starts, end(gr))
        seq <- rep(seqnames(gr), lengths(starts))
        strand <- rep(strand(gr), lengths(starts))
        result <- GRanges(seq, IRanges(unlist(starts), unlist(ends)), strand)
        seqinfo(result) <- seqinfo(gr)
        if (withMcols) {
            idx <- rep(seq_len(nrow(gr)), lengths(starts))
            mcols(result) = mcols(gr)[idx,,drop=FALSE]
        }
        result
    }
    

    invoked as

    > windows(rngs, 2000)
    

    If the approach is useful, consider asking follow-up questions on the Bioconductor support site.