Search code examples
rdna-sequencesliding-window

Finding Specific Vector Entries in a Sliding Window


I am trying to create a function that will return counts of specific adjacent nucleotides (CG beside eachother) within a specific window that I have formatted in a vector.

I would like the windows to be 100 nucleotides long and move shift every 10.

The data is setup like this (to 10k entries):

data <- c("a", "g", "t", "t", "g", "t", "t", "a", "g", "t", "c", "t",
          "a", "c", "g", "t", "g", "g", "a", "c", "c", "g", "a", "c")

So far I have tried this:

library(zoo)
library(seqinr)
rollapply(data, width=100, by=10, FUN=count(data, wordsize=2))

But I always get the error

"Error in match.fun(FUN) : 
'count(data, 2)' is not a function, character or symbol"

I have also tried:

starts <- seq(1, length(data)-100, by = 100)
n <- length(starts)
for (i in 1:n){
    chunk <- data[starts[i]:(starts[i]+99)]
    chunkCG <- count(chunk,wordsize=2)
    print (chunkCG)
}

However, I do not know how to save the data that is returned. This approach also does not allow me to overlap frames.


Solution

  • EDIT: To get the desired output with a 10 observation sliding window you can use a for loop. Since we pre-allocate the size of our result vector, the loop is reasonably fast. I think this is the best way to solve your problem since I dont think a lot of grouping (if any) supports a sliding window:

    library(data.table)
    set.seed(1)
    #Sample data
    df<-data.frame(var=sample(c("a","g","t","c"),600,replace=T))
    
    #The number of windows you want, shift by 10 each time
    n_windows <- ((nrow(df) - 100) / 10) + 1
    
    #Create empty DF, this helps increase speed of below loop
    res <- data.frame(window=rep(NA,n_windows),count_cg=rep(NA,n_windows))
    
    #Loop over each i, paste a leaded version of your sequence onto current sequence and count "cg"s
    for (i in 1:n_windows){
          res$window[i] <- paste0((i-1)*10 + 1,"-",(i-1)*10 + 100)
          subs <- df[((i-1)*10 + 1):((i-1)*10 + 100),"var"]
          subs2<- paste0(as.character(subs),as.character(shift(subs,1L,type="lead")[1:length(subs) - 1]))
          res$count_cg[i] <- sum(subs2=="cg")
    }
       head(res)
      window count_cg
    1  1-100       10
    2 11-110       10
    3 21-120        8
    4 31-130        9
    5 41-140        9
    6 51-150        9