Search code examples
regexregex-lookaroundswikipedia

Regex capturing too much


I have a problem with a .NET regex that I need to create for my AutoWikiBrowser bot on Wikipedia.

The example is rather long, but I need an even longer regex to find up to 14 language indication templates (2-3 letters inside double curly brackets, e.g. {{ab}}) and merge them into a single template (e.g. {{ab}} {{cd}} {{ef}} {{gh}} => {{mul|ab|cd|ef|gh}}

Here is my regex:

Find: \{\{ *(ab|cd|ef|gh) *\}\} *\{\{ *(ab|cd|ef|gh) *\}\} *(\{\{ *(ab|cd|ef|gh) *\}\})* *(\{\{ *(ab|cd|ef|gh) *\}\})* *(\{\{ *(ab|cd|ef|gh) *\}\})* *(\{\{ *(ab|cd|ef|gh) *\}\})*

Replace: {{mul|$1|$2|$4|$6|$8|$10}}

It is actually working as intended, except if templates are not separated by a space, then the last templates aren't captured properly. You can see the problem with the first line of the test string here: https://regex101.com/r/nMUg0J/2

I think I should use a lookaround, but I can't even find where the problem is.

Note that this regex will create templates with useless pipes if there isn't enough templates to marge, but I'll use this other regex after the first one to remove them: https://regex101.com/r/MuIiWS/1


Solution

  • This is almost certainly more easily achieved by using a replacement function with a single regex, but if you are restricted to regex only, possibly an easier solution is to first replace the }}{{ between templates with a | and then add the mul| at the beginning of any multiple lanugage template. So first, replace:

    (?<=ab|cd|ef|gh) *}} *{{ *(?=ab|cd|ef|gh)
    

    with | (demo on regex101), and then replace

    (?<={{)(?=(?:ab|cd|ef|gh)\|)
    

    with mul| (demo on regex101)