Search code examples
rstringdataframedplyrcharacter

Finding the longest stretch of repeated words in a long string of characters


I have a long DNA sequence text file with characters (ATCG). I am looking for some method in R that can be used to find the longest stretch with repeated words. Lets say my string looks like, AAGTGCGGGTTCAGATCGCCCCCCCATCGGGCAAAAAAAAAAAAAAAATCGA

I need the output possibly with counts, AAAAAAAAAAAAAAAA n=16

Please help me with this.


Solution

  • if you have one string:

    library(tidyverse)
    string <- "AAGTGCGGGTTCAGATCGCCCCCCCATCGGGCAAAAAAAAAAAAAAAATCGA"
    
    x <- str_extract_all(string, "(.)\\1+")
    x[which.max(nchar(x))]
    
    [1] "AAAAAAAAAAAAAAAA"
    

    if you have many strings:

    str_extract_all(c(string, string), "(.)\\1+")%>%
      map_chr(~.x[which.max(nchar(.x))])
    
    [1] "AAAAAAAAAAAAAAAA" "AAAAAAAAAAAAAAAA"
    

    To find the counts, just use nchar or even str_count of the result