Search code examples
regexregex-lookaroundswikipedia

Regex for removing duplicate values in a specific Wikipedia template


I am trying to remove duplicate values in a Wikipedia template (and only in this one) with a regex using the AutoWikiBrowser bot (that works with the .NET flavour).

I want to find {{mul|fr|en|fr}} and replace it with {{mul|fr|en}}

\b(\w+)\s*\|\s*(?=.*\1) works, but it may also affect other templates that should not be modified.

I tried \{\{mul\|\b(\w+)\s*\|\s*(?=.*\1), but it doesn't work properly.

Note: a Wikipedia template is encapsulated in double curly brackets, with its name followed by parameters and values separated by pipes. Here, the parameters are unnamed and absent, and the template is named "mul", which gives {{mul|<foo>|<bar>|<baz>|<...>}}


Solution

  • You can use

    (?<={{mul\|(?:(?!{{|}}).)*?)\b(\w+)\|(?=(?:(?!{{|}}).)*\b\1\b)
    

    See the regex demo.

    Details:

    • (?<={{mul\|(?:(?!{{|}}).)*?) - a variable-length lookbehind (supported in the .NET regex flavor) that matches a location immediately preceded with {{mul| + any zero or more (but as few as possible) repetitions of any char that is not a starting point of a {{ and }} char sequence
    • \b(\w+)\| - one or more word characters (captured into Group 1) and then a | char
    • (?=(?:(?!{{|}}).)*\b\1\b) - a positive lookahead that requires zero or more (but as few as possible) repetitions of any char that is not a starting point of a {{ and }} char sequence and then Group 1 value as a whole word immediately to the right of the current location.