Search code examples
pythonreplaceref

replace some part of a word with regex


how do you delete text inside <ref> *some text*</ref> together with ref itself?

in '...and so on<ref>Oxford University Press</ref>.'

re.sub(r'<ref>.+</ref>', '', string) only removes <ref> if <ref> is followed by a whitespace

EDIT: it has smth to do with word boundaries I guess...or?

EDIT2 What I need is that it will math the last (closing) </ref> even if it is on a newline.


Solution

  • I don't really see you problem, because the code pasted will remove the <ref>...</ref> part of the string. But if what you mean is that and empty ref tag is not removed:

    re.sub(r'<ref>.+</ref>', '', '...and so on<ref></ref>.')
    

    Then what you need to do is change the .+ with .*

    A + means one or more, while * means zero or more.

    From http://docs.python.org/library/re.html:

    '.' (Dot.) In the default mode, this matches any character except a newline.
        If the DOTALL flag has been specified, this matches any character including
        a newline.
    '*' Causes the resulting RE to match 0 or more repetitions of the preceding
        RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’
        followed by any number of ‘b’s.
    '+' Causes the resulting RE to match 1 or more repetitions of the preceding
        RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will
        not match just ‘a’.
    '?' Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
        ab? will match either ‘a’ or ‘ab’.