Search code examples
javascriptregexmediawiki

Regexp assistance needed parsing mediawiki template with Javascript


I'm handling Mediawiki markup with Javascript. I'm trying to remove certain parameters. I'm having trouble getting to exactly the text, and only the text, that I want to remove.

Simplified down, the template text can look something like this:

{{TemplateX
| a =
Foo bar
Blah blah

Fizbin foo[[domain:blah]]

Ipsum lorem[[domain:blah]]
|b =1
|c = 0fillertext
|d = 1alphabet
| e =
| f = 10: One Hobbit
| g = aaaa, bbbb, cccc, dddd
|h = 15000
|i = -15000
| j = Level 4 [[domain:filk|Songs]]
| k =7 fizbin, 8 [[domain:trekkies|Shatners]]
|l = 
|m = 
}}

The best I've come up with so far is

/\|\s?(a|b|d|f|j|k|m)([^][^\n\|])+/gm

Updated version:

/\|\s?(a|b|d|f|j|k|m)(?:[^\n\|]|[.\n])+/gm

which gives (with the updated regexp):

{{TemplateX


|c = 0fillertext

| e =

| g = aaaa, bbbb, cccc, dddd
|h = 15000
|i = -15000

|Songs]]

|Shatners]]
|l = 

But what I'm trying to get is:

{{TemplateX
|c = 0fillertext
| e =
| g = aaaa, bbbb, cccc, dddd
|h = 15000
|i = -15000
|l = 
}}

I can deal with the extraneous newlines, but I still need to make sure that '|Songs]]' and '|Shatners]]' are also matched by the regexp.

Regarding Tgr's comment below,

For my purposes, it is safe to assume that every parameter starts on a new line, where | is the first character on the line, and that no parameter definition includes a | that isn't within a [[foo|bar]] construct. So '\n|' is a safe "start" and "stop" sequence. So the question boils down to, for any given params (a,b,d,f,j,k, and m in the question), I need a regex that matches 'wanted param' in the following:

| [other param 1] = ... 
| [wanted param] = possibly multiple lines and |s that aren't after a newline
| [other param 2]

Solution

  • You can try this below - it is matching on the variables you want to include, not those you want to exclude:

    (^{{TemplateX)|\|\s*(c|e|g|h|i|l[ ]*\=[ ]*)(.*)|(}}$)
    

    Tested here.

    Edit

    I enhanced it to this which I think is a bit better if you compare the two regexes using the diagram tool at regexper.com:

    (^{{TemplateX)|(\|[ ]*)(c|e|g|h|i|l)([ ]*\=[ ]*)(.*)|(}}$)
    

    enter image description here

    Edit 2

    Further to the comments, the regex to match the unwanted parameters is this:

    \|[ ]?(a|b|d|f|j|k|m)([ ]*\=[ ]*)((?![\r\n]+\|)[0-9a-zA-Z, \[\]:\|\r\n\t])+
    

    Leveraging this answer - it uses a negative lookahead to only match upto [\r\n]+\| which will in part satisfy the statement that:

    So '\n|' is a safe "start" and "stop" sequence

    Tested here with the introduction of a few newlines in the parameters to be retained (e.g. g).

    The visual explanation:

    enter image description here

    There is a risk that you may have a parameter value with a character other than

    [0-9a-zA-Z, \[\]:\|\r\n\t]
    

    To solve that you would need to update that list.