I am facing an issue to find all strings which have first few (>=2) characters repeated twice in a string in R language.
E.g
The strings should select out
(1) allochirally ------> first 3 characters 'all' repeated twice in string
(2) froufrou ------> first 4 characters 'frou' repeated twice in string
(3) undergrounder ------> first 5 characters 'under' repeated twice in string
The strings should NOT select out
(1) gummage ------> even first character 'g' repeated twice, but only 1 character, not match condition as >=2 first characters
(2) hypergoddess ------> no first few characters repeated twice
(3) kgashga ------> even 'ga' repeated twice, but not including the first character 'k', not match condition which require including the first character
Heard about backreference
(e.g \b or \w) might be helpful, but still not able to figure out, could you help to figure out ?
Note: I see there is a function as xmatch <- str_extract_all(x, regex) == x
as the method to use, the str_extract_all
from library(stringr)
x <- c("allochirally", "froufrou", "undergrounder", "gummage", "hypergoddess", "kgashga")
regex <- "as described details here"
function(x, regex) {
xmatch <- str_extract_all(x, regex) == x
matched_x <- x[xmatch]
}
If very concise would prefer!! Thanks
Use grepl
:
x <- c("allochirally", "froufrou", "undergrounder", "gummage", "hypergoddess", "kgashga")
grepl("^(.{2,}).*\\1.*$", x)
[1] TRUE TRUE TRUE FALSE FALSE FALSE
The regex pattern matches and captures the first two or more characters, and then also asserts that the same two or more characters occur later in the string.
If you want to use the logic in my answer to obtain a vector of matching strings, then just use:
x[grepl("^(.{2,}).*\\1.*$", x)]
[1] "allochirally" "froufrou" "undergrounder"