I'm trying to remove line breaks with Python from wikitext templates of the form:
{{cite web
|title=Testing
|url=Testing
|editor=Testing
}}
The following should be obtained with re.sub:
{{cite web|title=Testing|url=Testing|editor=Testing}}
I've been trying with Python regex for hours, yet haven't succeeded at it. For example I've tried:
while(re.search(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}')):
textmodif=re.sub(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}', r'{cite web\1\3}}', textmodif,re.DOTALL)
But it doesn't work as expected (even without the while loop, it's not working for the first line break).
I found this similar question but it didnt help: Regex for MediaWiki wikitext templates . I'm quite new at Python so please don't be too hard on me :-)
Thank you in advance.
You need to switch on newline matching for .
; it does not match a newline otherwise:
re.search(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}', inputtext, flags=re.DOTALL)
You have multiple newlines spread throughout the text you want to match, so matching just one set of consecutive newlines is not enough.
From the re.DOTALL
documentation:
Make the
'.'
special character match any character at all, including a newline; without this flag,'.'
will match anything except a newline.
You could use one re.sub()
call to remove all newlines within the cite
stanza in one go, without a loop:
re.sub(r'\{cite web.*?[\r\n]+.*?\}\}', lambda m: re.sub('\s*[\r\n]\s*', '', m.group(0)), inputtext, flags=re.DOTALL)
This uses a nested regular expression to remove all whitespace with at least one newline in it from the matched text.
Demo:
>>> import re
>>> inputtext = '''\
... {{cite web
... |title=Testing
... |url=Testing
... |editor=Testing
... }}
... '''
>>> re.search(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}', inputtext, flags=re.DOTALL)
<_sre.SRE_Match object at 0x10f335458>
>>> re.sub(r'\{cite web.*?[\r\n]+.*?\}\}', lambda m: re.sub('\s*[\r\n]\s*', '', m.group(0)), inputtext, flags=re.DOTALL)
'{{cite web|title=Testing|url=Testing|editor=Testing}}\n'