I have a long DNA sequence text file with characters (ATCG). I am looking for some method in R that can be used to find the longest stretch with repeated words. Lets say my string looks like, AAGTGCGGGTTCAGATCGCCCCCCCATCGGGCAAAAAAAAAAAAAAAATCGA
I need the output possibly with counts, AAAAAAAAAAAAAAAA n=16
Please help me with this.
if you have one string:
library(tidyverse)
string <- "AAGTGCGGGTTCAGATCGCCCCCCCATCGGGCAAAAAAAAAAAAAAAATCGA"
x <- str_extract_all(string, "(.)\\1+")
x[which.max(nchar(x))]
[1] "AAAAAAAAAAAAAAAA"
if you have many strings:
str_extract_all(c(string, string), "(.)\\1+")%>%
map_chr(~.x[which.max(nchar(.x))])
[1] "AAAAAAAAAAAAAAAA" "AAAAAAAAAAAAAAAA"
To find the counts, just use nchar
or even str_count
of the result