I'm trying to remove line breaks with Python from wikitext templates of the form:
{{cite web
The following should be obtained with re.sub:
{{cite web|title=Testing|url=Testing|editor=Testing}}
I've been trying with Python regex for hours, yet haven't succeeded at it. For example I've tried:
while(re.search(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}')):
textmodif=re.sub(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}', r'{cite web\1\3}}', textmodif,re.DOTALL)
But it doesn't work as expected (even without the while loop, it's not working for the first line break).
I found this similar question but it didnt help: Regex for MediaWiki wikitext templates . I'm quite new at Python so please don't be too hard on me :-)
Thank you in advance.
You need to switch on newline matching for .
; it does not match a newline otherwise:
re.search(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}', inputtext, flags=re.DOTALL)
You have multiple newlines spread throughout the text you want to match, so matching just one set of consecutive newlines is not enough.
From the re.DOTALL
Make the
special character match any character at all, including a newline; without this flag,'.'
will match anything except a newline.
You could use one re.sub()
call to remove all newlines within the cite
stanza in one go, without a loop:
re.sub(r'\{cite web.*?[\r\n]+.*?\}\}', lambda m: re.sub('\s*[\r\n]\s*', '', m.group(0)), inputtext, flags=re.DOTALL)
This uses a nested regular expression to remove all whitespace with at least one newline in it from the matched text.
>>> import re
>>> inputtext = '''\
... {{cite web
... |title=Testing
... |url=Testing
... |editor=Testing
... }}
... '''
>>> re.search(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}', inputtext, flags=re.DOTALL)
<_sre.SRE_Match object at 0x10f335458>
>>> re.sub(r'\{cite web.*?[\r\n]+.*?\}\}', lambda m: re.sub('\s*[\r\n]\s*', '', m.group(0)), inputtext, flags=re.DOTALL)
'{{cite web|title=Testing|url=Testing|editor=Testing}}\n'