Genome coverage as sliding window

I mapped my reads to my assembly using the bwa mem algorithm and extracted the number of reads per base (= coverage) using samtools depth. The resulting file is the following:

1091900001  1   236
1091900001  2   245
1091900001  3   265
1091900001  4   283
1091900001  5   288
1091900002  1   297
1091900002  2   312
1091900002  3   327
1091900002  4   338
1091900002  5   348

with three columns: name of the contig (since it is a multi-contig file, this ID changes) - position (base) - number of reads that mapped (coverage).

Now I want to calculate the coverage (third column) in sliding windows; in a window size of 3 and slide of 2 as the mean - per contig (first column).

I want to use the rollapply function of the zoo package.

require(zoo)
cov <- read.table("file",header=FALSE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)
library(reshape) #loads the library to rename the column names
cov<-rename(cov,c(V1="Chr", V2="locus", V3="depth")) #renames the header
rollapply(cov$depth, width = 3, by = 2, FUN = mean, align = "left")

But this of course does not take into account the contig. Plus, my expected output should include the contig-info & the window, it was calculated:

1091900001  1   3   248.6667
1091900001  3   5   278.6667
1091900002  1   3   312.0000
1091900002  3   5   337.6667

Is there an easy way for doing that in R?

Solution

Here's how you can do this with the dplyr functions group_by and do:

library(dplyr)

cov %>% 
  group_by(Chr) %>% 
  do(
    data.frame(
      window.start = rollapply(.$locus, width=3, by=2, FUN=min, align="left"),
      window.end = rollapply(.$locus, width=3, by=2, FUN=max, align="left"),
      coverage = rollapply(.$depth, width=3, by=2, FUN=mean, align="left")
      )
    )

# # A tibble: 4 x 4
# # Groups:   Chr [2]
#          Chr window.start window.end coverage
#        <int>        <int>      <int>    <dbl>
# 1 1091900001            1          3 248.6667
# 2 1091900001            3          5 278.6667
# 3 1091900002            1          3 312.0000
# 4 1091900002            3          5 337.6667

do allows you to return an arbitrary number of values from grouped operations in the form of a data.frame. In this case, we return a the rolling mean of the coverage value, along with the min and max values from locus from each window.

Edit:

If your dataset is large, you may be better off performing the calculation using data.table. Its syntax is a bit harder to understand if you haven't seen it before, but it can offer substantial speed improvements in grouped operations on bigger data. Here's how your operation works with data.table:

library(data.table)    

setDT(cov)
cov[, .(
      window.start = rollapply(locus, width=3, by=2, FUN=min, align="left"),
      window.end = rollapply(locus, width=3, by=2, FUN=max, align="left"),
      coverage = rollapply(depth, width=3, by=2, FUN=mean, align="left")
      ),
    .(Chr)]

Based on the sample rows you provided, here are the benchmark results for the dplyr and data.table approaches (measured in milliseconds):

# dplyr:
      min       lq     mean   median       uq      max neval
 7.811753 8.685976 10.10268 9.243551 10.42691 144.5274  1000

# data.table:
      min       lq     mean  median       uq      max neval
 1.924472 2.105459 2.510832 2.30479 2.685706 8.848451  1000

So on the sample data, the data.table option is about 4x faster on average.