I need to find words which are repeated 3 or more times anywhere in the text. It's quite easy to find consecutive repeated words like this:
\b(\w+)\s+\1\b
But i really can't get an idea how to set one backreference for each of this words. I should only set one backreference for each group of repeated words.
How to select more and more of words and words words and even more of those more
Is this possible to backreference more and words in this example?
\b(\w+).*\1\b
According to the comments, for your requirement you could use:
\b(\w{4,})\b(?=.*?\b(\1)\b.*?\b(\1)\b)
About the groups
The pattern makes use of a capturing group outside and 2 capturing groups inside the positive lookahead.
For every word captured in group 1, the nearest 2 words are captured in group 2 and 3 making them at least 3 times repeated.
The thing to remember is that there are overlapping matches if processing the groups and matches afterwards.
Explanation
\b
Word boundary(
Capture group 1
\w{4,}
Match 4 or more times a word character)
Close group\b
Word boundary(?=
Positive lookahead, assert what is on the right is
.*?
Match any char except a newline non greedy\b(\1)\b
Capture group 2, match group 1.*?
Match any char except a newline non greedy\b(\1)\b
Capture group 3 match group 1)
Edit
To match a word followed by 2 times but not 3 times the same word, you could use a positive lookahead (?=
to assert group 1 is followed by 2 times group 1 and a negative lookahead (?!
to assert that group 1 is not followed 3 times by group 1.
\b(\w{4,})(?=(?:.*\b\1\b){2})(?!(?:.*\b\1\b){3})