Search code examples
rstringunique

How to extract unique letters among word of consecutive letters?


For example, there is character x = "AAATTTGGAA".

What I want to achieve is, from x, split x by consecutive letters, "AAA", "TTT", "GG", "AA".

Then, unique letters of each chunk is "A", "T", "G", "A" , so expected output is ATGA.

How should I get this?


Solution

  • Here is a useful regex trick approach:

    x <- "AAATTTGGAA"
    out <- strsplit(x, "(?<=(.))(?!\\1)", perl=TRUE)[[1]]
    out
    
    [1] "AAA" "TTT" "GG"  "AA"
    

    The regex pattern used here says to split at any boundary where the preceding and following characters are different.

    (?<=(.))  lookbehind and also capture preceding character in \1
    (?!\\1)   then lookahead and assert that following character is different