Suppose I have the following sequence:
AAAAAAAAAAAAGCCAGGTGCGGTGGCTCATGCCTGTAAGCCCAGCACTTTGGGAGGCCAAGGCAGGCGGATCACTAGAGGTCAG
starting from position A (which is bold in the sequence) I want to split it into chunks with a length of 5 characters but I want the chunks to be 3 characters distant from each other meaning I want to get
'GGTGC' , 'GGCTC' , 'CCTGT', 'CCCAG', and so on till the end. Then I would like to get the same information from the bold A to the start of the sequence meaning:
AAGCC, AAAAA ,...
How can I do this?
We can use a regex lookaround to do the split i.e. we split at 3 characters (.
- represents any character in regex) that succeeds 5 characters
strsplit(str1, "(?<=.....)...", perl = TRUE)[[1]]
Or if we want to construct the pattern dynamically use strrep
with paste
n1 <- 200
n2 <- 50
pat <- paste0("(?<=", strrep(".", n1), ")", strrep(".", n2))
str1 <- "AAAAAAAAAAAAGCCAGGTGCGGTGGCTCATGCCTGTAAGCCCAGCACTTTGGGAGGCCAAGGCAGGCGGATCACTAGAGGTCAG"