Search code examples
rdplyraggregategenomicranges

aggregate bins in large GRanges efficiently


I have a SummarizedExperiment (but we can consider it a GRanges). What I want is to reduce the number of intervals, keeping only one row for every identical adjacent mcol(gr), important is to also keep track of the new extend interval.

Thanks a lot!

gr <- GRanges(
seqnames = Rle(c("chr1"), c(12)),
ranges = IRanges(1:12*10, end = 1:12*10+5),
state1 = c(1,1,1,1,2,3,4,5,5,5,1,1),
state2 = c(1,1,1,2,2,2,5,5,6,6,1,1))

The resulting GRanges should look like this:

gr2 <- GRanges(
  seqnames = Rle(c("chr1"), c(8)),
  ranges = IRanges(start = c(10,40,50,60,70,80,90,110),
  end =  c(35,45,55,65,75,85,105,125)),
  state1 = c(1,1,2,3,4,5,5,1), state2 = c(1,2,2,2,5,5,6,1))​  

Edit: I have edit the Granges so that a state pair is present also in non adjacent intervals (this second 1,1 pair has to be report independely from the first) Sorry my initial solution was also wrong!

Thanks a lot!


Solution

  • Create an artificial factor, make sure the levels are in the order of occurrence in the factor (rather than the default alphabetical) to avoid re-arranging the GRanges, and split the GRanges object

    f0 = paste(gr$state1, gr$state2, sep=".")
    f = factor(f0, levels=unique(f0))
    grl = split(gr, f)
    

    Get the ranges and relevant metadata

    grf = unlist(range(grl), use.names=FALSE)
    mcols(grf) = mcols(gr)[!duplicated(f),]
    

    split(), range(), and unlist() should all be 'fast' for data of this size.

    To also split on chromosome, add that to the factor

    f0 = paste(seqnames(gr), gr$state1, gr$state2, sep=".")
    

    To split in some other way, e.g., only when states are adjacent, figure out a way to make the appropriate factor, e.g.,

    f0 = paste(
        seqnames(gr),
        cumsum(c(TRUE, diff(gr$state1) != 0)),
        cumsum(c(TRUE, diff(gr$state2) != 0)),
        sep=".")
    

    Ask questions about Bioconductor packages on the Bioconductor support site.