how do you delete text inside <ref> *some text*</ref>
together with ref
itself?
in '...and so on<ref>Oxford University Press</ref>.'
re.sub(r'<ref>.+</ref>', '', string)
only removes <ref>
if
<ref>
is followed by a whitespace
EDIT: it has smth to do with word boundaries I guess...or?
EDIT2 What I need is that it will math the last (closing) </ref>
even if it is on a newline.
I don't really see you problem, because the code pasted will remove the <ref>...</ref>
part of the string. But if what you mean is that and empty ref tag is not removed:
re.sub(r'<ref>.+</ref>', '', '...and so on<ref></ref>.')
Then what you need to do is change the .+ with .*
A + means one or more, while * means zero or more.
From http://docs.python.org/library/re.html:
'.' (Dot.) In the default mode, this matches any character except a newline.
If the DOTALL flag has been specified, this matches any character including
a newline.
'*' Causes the resulting RE to match 0 or more repetitions of the preceding
RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’
followed by any number of ‘b’s.
'+' Causes the resulting RE to match 1 or more repetitions of the preceding
RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will
not match just ‘a’.
'?' Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
ab? will match either ‘a’ or ‘ab’.