Search code examples
rbioinformaticsrollapply

Genome coverage as sliding window


I mapped my reads to my assembly using the bwa mem algorithm and extracted the number of reads per base (= coverage) using samtools depth. The resulting file is the following:

1091900001  1   236
1091900001  2   245
1091900001  3   265
1091900001  4   283
1091900001  5   288
1091900002  1   297
1091900002  2   312
1091900002  3   327
1091900002  4   338
1091900002  5   348

with three columns: name of the contig (since it is a multi-contig file, this ID changes) - position (base) - number of reads that mapped (coverage).

Now I want to calculate the coverage (third column) in sliding windows; in a window size of 3 and slide of 2 as the mean - per contig (first column).

I want to use the rollapply function of the zoo package.

require(zoo)
cov <- read.table("file",header=FALSE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)
library(reshape) #loads the library to rename the column names
cov<-rename(cov,c(V1="Chr", V2="locus", V3="depth")) #renames the header
rollapply(cov$depth, width = 3, by = 2, FUN = mean, align = "left")

But this of course does not take into account the contig. Plus, my expected output should include the contig-info & the window, it was calculated:

1091900001  1   3   248.6667
1091900001  3   5   278.6667
1091900002  1   3   312.0000
1091900002  3   5   337.6667

Is there an easy way for doing that in R?


Solution

  • Here's how you can do this with the dplyr functions group_by and do:

    library(dplyr)
    
    cov %>% 
      group_by(Chr) %>% 
      do(
        data.frame(
          window.start = rollapply(.$locus, width=3, by=2, FUN=min, align="left"),
          window.end = rollapply(.$locus, width=3, by=2, FUN=max, align="left"),
          coverage = rollapply(.$depth, width=3, by=2, FUN=mean, align="left")
          )
        )
    
    # # A tibble: 4 x 4
    # # Groups:   Chr [2]
    #          Chr window.start window.end coverage
    #        <int>        <int>      <int>    <dbl>
    # 1 1091900001            1          3 248.6667
    # 2 1091900001            3          5 278.6667
    # 3 1091900002            1          3 312.0000
    # 4 1091900002            3          5 337.6667
    

    do allows you to return an arbitrary number of values from grouped operations in the form of a data.frame. In this case, we return a the rolling mean of the coverage value, along with the min and max values from locus from each window.

    Edit:

    If your dataset is large, you may be better off performing the calculation using data.table. Its syntax is a bit harder to understand if you haven't seen it before, but it can offer substantial speed improvements in grouped operations on bigger data. Here's how your operation works with data.table:

    library(data.table)    
    
    setDT(cov)
    cov[, .(
          window.start = rollapply(locus, width=3, by=2, FUN=min, align="left"),
          window.end = rollapply(locus, width=3, by=2, FUN=max, align="left"),
          coverage = rollapply(depth, width=3, by=2, FUN=mean, align="left")
          ),
        .(Chr)]
    

    Based on the sample rows you provided, here are the benchmark results for the dplyr and data.table approaches (measured in milliseconds):

    # dplyr:
          min       lq     mean   median       uq      max neval
     7.811753 8.685976 10.10268 9.243551 10.42691 144.5274  1000
    
    # data.table:
          min       lq     mean  median       uq      max neval
     1.924472 2.105459 2.510832 2.30479 2.685706 8.848451  1000
    

    So on the sample data, the data.table option is about 4x faster on average.