Search code examples
pythonregexwikitext

Python regex on wikitext template


I'm trying to remove line breaks with Python from wikitext templates of the form:

{{cite web
|title=Testing
|url=Testing
|editor=Testing
}}

The following should be obtained with re.sub:

{{cite web|title=Testing|url=Testing|editor=Testing}}

I've been trying with Python regex for hours, yet haven't succeeded at it. For example I've tried:

while(re.search(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}')):
     textmodif=re.sub(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}', r'{cite web\1\3}}', textmodif,re.DOTALL)

But it doesn't work as expected (even without the while loop, it's not working for the first line break).

I found this similar question but it didnt help: Regex for MediaWiki wikitext templates . I'm quite new at Python so please don't be too hard on me :-)

Thank you in advance.


Solution

  • You need to switch on newline matching for .; it does not match a newline otherwise:

    re.search(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}', inputtext, flags=re.DOTALL)
    

    You have multiple newlines spread throughout the text you want to match, so matching just one set of consecutive newlines is not enough.

    From the re.DOTALL documentation:

    Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

    You could use one re.sub() call to remove all newlines within the cite stanza in one go, without a loop:

    re.sub(r'\{cite web.*?[\r\n]+.*?\}\}', lambda m: re.sub('\s*[\r\n]\s*', '', m.group(0)), inputtext, flags=re.DOTALL)
    

    This uses a nested regular expression to remove all whitespace with at least one newline in it from the matched text.

    Demo:

    >>> import re
    >>> inputtext = '''\
    ... {{cite web
    ... |title=Testing
    ... |url=Testing
    ... |editor=Testing
    ... }}
    ... '''
    >>> re.search(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}', inputtext, flags=re.DOTALL)
    <_sre.SRE_Match object at 0x10f335458>
    >>> re.sub(r'\{cite web.*?[\r\n]+.*?\}\}', lambda m: re.sub('\s*[\r\n]\s*', '', m.group(0)), inputtext, flags=re.DOTALL)
    '{{cite web|title=Testing|url=Testing|editor=Testing}}\n'