Search code examples
javascriptregexmediawiki-templates

How to remove all Wiki templates from a string?


I have content of Wikipedia article that have stuff like this:

{{Use mdy dates|date=June 2014}}
{{Infobox person
| name        = Richard Matthew Stallman
| image       = Richard Stallman - Fête de l'Humanité 2014 - 010.jpg
| caption     = Richard Stallman, 2014
| birth_date  = {{Birth date and age|1953|03|16}}
| birth_place = New York City
| nationality = American
| other_names = RMS, rms
| known_for   = Free software movement, GNU, Emacs, GNU Compiler Collection|GCC
| alma_mater  = Harvard University,<br />Massachusetts Institute of Technology
| occupation  = President of the Free Software Foundation
| website     = {{URL|https://www.stallman.org/}}
| awards      =  MacArthur Fellowship<br />EFF Pioneer Award<br />''... see #Honors and awards|Honors and awards''
}}

or

{{Citation needed|date=May 2011}}

How to remove it? I could use this regex: /\{\{[^}]+\}\}/g but it will not work for nested template like Infobox

I've tried to use this code to first remove nested templates and then remove the Infobox but I've got wrong result.

var input = document.getElementById('input');
input.innerHTML = input.innerHTML.replace(/\{\{[^}]+\}\}/g, '');
<pre id="input">    {{Use mdy dates|date=June 2014}}
    {{Infobox person
    | name        = Richard Matthew Stallman
    | image       =Richard Stallman - Fête de l'Humanité 2014 - 010.jpg
    | caption     = Richard Stallman, 2014
    | birth_date  = {{Birth date and age|1953|03|16}}
    | birth_place = New York City
    | nationality = American
    | other_names = RMS, rms
    | known_for   = Free software movement, GNU, Emacs, GNU Compiler Collection|GCC
    | alma_mater  = Harvard University,<br />Massachusetts Institute of Technology
    | occupation  = President of the Free Software Foundation
    | website     = {{URL|https://www.stallman.org/}}
    | awards      =  MacArthur Fellowship<br />EFF Pioneer Award<br />''... see #Honors and awards|Honors and awards''
    }}</pre>


Solution

  • Javascript regexes don't have features (like recursion or balancing groups) to match nested brackets. A way with regex consists to process the string several times with a pattern that find the innermost brackets until there's nothing to replace:

    do {
        var cnt=0;
        txt = txt.replace(/{{[^{}]*(?:{(?!{)[^{}]*|}(?!})[^{}]*)*}}/g, function (_) {
            cnt++; return '';
        });
    } while (cnt);
    

    pattern details:

    {{
    [^{}]* # all that is not a bracket
    (?: # this group is only useful if you need to allow single brackets
        {(?!{)[^{}]* # an opening bracket not followed by an other opening bracket
      |   # OR
        }(?!})[^{}]* # same thing for closing brackets
    )*
    }}
    

    If you don't want to process the string several times, you can also read the string character by character increasing and decreasing a flag when brackets are found.

    An other way using split and Array.prototype.reduce:

    var stk = 0;
    var result = txt.split(/({{|}})/).reduce(function(c, v) {
        if (v == '{{') { stk++; return c; }
        if (v == '}}') { stk = stk ? stk-1 : 0; return c; }
        return stk ? c : c + v;
    });