Search code examples
regexpcre

Regex. PCRE. Find repeated words anywhere in the text


I need to find words which are repeated 3 or more times anywhere in the text. It's quite easy to find consecutive repeated words like this:

\b(\w+)\s+\1\b

But i really can't get an idea how to set one backreference for each of this words. I should only set one backreference for each group of repeated words.

How to select more and more of words and words words and even more of those more

Is this possible to backreference more and words in this example?

\b(\w+).*\1\b

Solution

  • According to the comments, for your requirement you could use:

    \b(\w{4,})\b(?=.*?\b(\1)\b.*?\b(\1)\b)
    

    About the groups

    The pattern makes use of a capturing group outside and 2 capturing groups inside the positive lookahead.

    For every word captured in group 1, the nearest 2 words are captured in group 2 and 3 making them at least 3 times repeated.

    The thing to remember is that there are overlapping matches if processing the groups and matches afterwards.

    Explanation

    • \b Word boundary
    • ( Capture group 1
      • \w{4,} Match 4 or more times a word character
    • ) Close group
    • \b Word boundary
    • (?= Positive lookahead, assert what is on the right is
      • .*? Match any char except a newline non greedy
      • \b(\1)\b Capture group 2, match group 1
      • .*? Match any char except a newline non greedy
      • \b(\1)\b Capture group 3 match group 1
    • )

    Regex demo

    Edit

    To match a word followed by 2 times but not 3 times the same word, you could use a positive lookahead (?= to assert group 1 is followed by 2 times group 1 and a negative lookahead (?! to assert that group 1 is not followed 3 times by group 1.

    \b(\w{4,})(?=(?:.*\b\1\b){2})(?!(?:.*\b\1\b){3})
    

    Regex demo