Search code examples
rdataframesplitpartitionseq

Splitting data into chunks but with distance in between


Suppose I have the following sequence:

AAAAAAAAAAAAGCCAGGTGCGGTGGCTCATGCCTGTAAGCCCAGCACTTTGGGAGGCCAAGGCAGGCGGATCACTAGAGGTCAG

starting from position A (which is bold in the sequence) I want to split it into chunks with a length of 5 characters but I want the chunks to be 3 characters distant from each other meaning I want to get

'GGTGC' , 'GGCTC' , 'CCTGT', 'CCCAG', and so on till the end. Then I would like to get the same information from the bold A to the start of the sequence meaning:

AAGCC, AAAAA ,...

How can I do this?


Solution

  • We can use a regex lookaround to do the split i.e. we split at 3 characters (. - represents any character in regex) that succeeds 5 characters

    strsplit(str1, "(?<=.....)...", perl = TRUE)[[1]]
    

    Or if we want to construct the pattern dynamically use strrep with paste

    n1 <- 200
    n2 <- 50
    pat <- paste0("(?<=", strrep(".", n1), ")", strrep(".", n2))
    

    data

    str1 <- "AAAAAAAAAAAAGCCAGGTGCGGTGGCTCATGCCTGTAAGCCCAGCACTTTGGGAGGCCAAGGCAGGCGGATCACTAGAGGTCAG"