Search code examples
pythonregexregex-lookaroundslookbehind

python regex lookbehind lookahead


I posted a question a few days ago about how to catch the words in a text preceding a certain regex match. enter link description here

With the solutions proposed I play around in regex101 trying to get the words that FOLLOW the match.

This is the code:

content="""Lorem ipsum dolor sit amet (12,16) , consectetur 23 adipiscing elit. Curabitur (45) euismod scelerisque consectetur. Vivamus aliquam velit (46,48,49) at augue faucibus, id eleifend purus (34) egestas. Aliquam vitae mauris cursus, facilisis enim (23) condimentum, vestibulum enim. """

print(content)
pattern =re.compile(r"((?:\w+ ?){1,5}(?=\(\d))(\([\d]+\))(?: )(?:\w+ ?){1,5}")
matches = pattern.findall(content)
print('the matches are:')
print(matches)

the regex works and catches numbers between parenthesis.

this being the explanation of the regex

((?:\w+ ?){1,5}(?=\(\d))(\([\d]+\))(?: )(?:\w+ ?){1,5}
________________________***********++++++++++++++

____ = this is the look behind. Looks for 1 to 5 words before the match up to finding an open (

****= the actual regex ===> numbers between parenthesis

++++= This is the part I pretend to use to catch words AFTER the regex.

I tried it in regex101 with this apparently nice result:

enter image description here

But the result of the code is the following:

[('Curabitur ', '(45)'), ('id eleifend purus ', '(34)'), ('facilisis enim ', '(23)')]

as you see the list includes tupples with first the preceding words, and then the match itself, BUT NOT THE FOLLOWING WORDS.

Where is the catch????

My expected result would be:

matches=[('Curabitur ', '(45)', '**euismod scelerisque consectetur**'), ('id eleifend purus ', '(34)', '**egestas**'), ('facilisis enim ', '(23)', '**condimentum**')]


Solution

  • Your regex needs to have a 3rd capturing group as well in order to be returned by findall:

    >>> print re.findall(r"((?:\w+ ?){1,5}(?=\(\d))(\(\d+\))(?: )((?:\w+ ?){1,5})", content)
    [('Curabitur ', '(45)', 'euismod scelerisque consectetur'), ('id eleifend purus ', '(34)', 'egestas'), ('facilisis enim ', '(23)', 'condimentum')]
    

    Note ((?:\w+ ?){1,5}) as 3rd capture group.

    Also note that [\d]+ is equivalent of \d+.