Search code examples
regexvb.netduplicatesexpression

Regular expression for duplicate character sequences


I want to use a regular expression to remove duplicate character sequences (words) from a string. My question is similar to the one entitled regular expression for duplicate words, but I have some additional requirements.

  1. I need to include additional characters. The accepted answer to the linked question only detects words consisting of alphanumeric characters, but I need to include symbol characters such as “@” in my definition of a word.

  2. I need to match multiple repetitions of a pattern. If a word is repeated three times, the accepted answer to the linked question only removes one of the duplicates, but I need to remove both of them.

Here is the sample string I am using for testing:

hello me now @@@ @@@ @@@ then method me @@@

My desired result is:

hello me now @@@ then method me @@@


Solution

  • The keys to solving this are:

    1. Use a lookbehind.
    2. Look for white-space (\s) and non-white-space (\S).

    Here is the regex you need: /(?<=(\S+)\s+)\1\s+/g

    Here is a demonstration of it working.

    Here is a screenshot of the demonstration.

    enter image description here


    Now I will explain the process of creating this regular expression. First, let’s state the goal. The goal is to match any word which is the same as the previous word, so that we can strip it, that is, replace it with nothing. So let’s step through the process:

    1. The first step is to match every word in your string. Normally you would use \w+, but that only matches alphanumeric characters. Instead, use \S+ which matches all characters which are not considered white space. Note that it matches “@@@” as well as the ordinary words.

    enter image description here

    1. The second step is to match a word only if it is preceded by another word. For this we use a lookbehind expression (?<= ... ), looking for a word \S+ followed by white space \s+. You can see in the screenshot that the very first word in the string is no longer matched. Perfect.

    enter image description here

    1. The third step is to match a word only if it is the same as the word preceding it. For that we need to capture the preceding word (by placing brackets around \S+ inside the lookbehind expression), then refer to that captured group in our match (replacing our original \S+ with \1).

    enter image description here

    1. Notice in the screenshot above that after stripping the matches (replacing them with nothing), we are still left with some extra spaces. We can avoid these by including any white space following the word in our original match expression, so we just add \s+ to the end of it. That brings us to the final result, which I illustrated at the beginning of this answer.

    /(?<=(\S+)\s+)\1\s+/g