Search code examples
pythonpython-re

python re.findall and re.sub


My code:

import re
print(re.findall(r'(?=(Deportivo))(?!.*\bla\b)','Deportivo coruna'))
print(re.sub(r'(?=(Deportivo))(?!.*\bla\b)','','Deportivo coruna'))

I am interested in removing 'Deportivo' if no 'la' in string.

for instance:

re.findall(r'(?=(Deportivo))(?!.*\bla\b)','Deportivo coruna')

returns ['Deportivo'] and

re.findall(r'(?=(Deportivo))(?!.*\bla\b)','Deportivo la coruna')

returns []

however,

re.sub(r'(?=(Deportivo))(?!.*\bla\b)','','Deportivo coruna')

returns 'Deportivo coruna', the string is unchanged. I am confused why, please help.


Solution

  • There is a difference in the way findall and sub work. According to the docs, re.findall() will return the contents of capturing groups, even if the match result itself is the empty string (which it is in your case, since the regex consists entirely of lookahead assertions).

    So if you want to remove Deportivo from your text if and only if it doesn't also contain la, you could use

    re.sub(r'^(?!.*\bla\b)(.*?)Deportivo)',r'\1','Deportivo coruna')
    

    However, that will only remove the first occurrence, and it's not trivial to change that because you would need unlimited repetition in lookbehind assertions, which Python doesn't support. For the record,

    re.sub(r'^(?<!\bla\b.*)Deportivo(?!.*\bla\b)','','Deportivo coruna')
    

    would do the trick, but that regex won't compile in Python.

    So your best bet probably is to do it in two steps. First, check that your string doesn't contain la. Then replace all Deportivos with the empty string.