Search code examples
rregexstringregular-language

Regular expression to match first few characters repeated twice in string


I am facing an issue to find all strings which have first few (>=2) characters repeated twice in a string in R language.
E.g

The strings should select out
(1) allochirally ------> first 3 characters 'all' repeated twice in string
(2) froufrou ------> first 4 characters 'frou' repeated twice in string
(3) undergrounder ------> first 5 characters 'under' repeated twice in string

The strings should NOT select out
(1) gummage ------> even first character 'g' repeated twice, but only 1 character, not match condition as >=2 first characters
(2) hypergoddess ------> no first few characters repeated twice
(3) kgashga ------> even 'ga' repeated twice, but not including the first character 'k', not match condition which require including the first character

Heard about backreference (e.g \b or \w) might be helpful, but still not able to figure out, could you help to figure out ?

Note: I see there is a function as xmatch <- str_extract_all(x, regex) == x as the method to use, the str_extract_all from library(stringr)

x <- c("allochirally", "froufrou", "undergrounder", "gummage", "hypergoddess", "kgashga")
regex <- "as described details here"
function(x, regex) {
  xmatch <- str_extract_all(x, regex) == x
  matched_x <- x[xmatch]
}

If very concise would prefer!! Thanks


Solution

  • Use grepl:

    x <- c("allochirally", "froufrou", "undergrounder", "gummage", "hypergoddess", "kgashga")
    grepl("^(.{2,}).*\\1.*$", x)
    
    [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE
    

    The regex pattern matches and captures the first two or more characters, and then also asserts that the same two or more characters occur later in the string.

    If you want to use the logic in my answer to obtain a vector of matching strings, then just use:

    x[grepl("^(.{2,}).*\\1.*$", x)]
    
    [1] "allochirally"  "froufrou"      "undergrounder"