Search code examples
pythonregexpython-reregex-greedy

Python Re - Named Capture Group Too Greedy


I would like to pick up "Bar" from the following strings:

FooFooFoo the FooFoo the Bar Foo
FooFooFoo the FooFoo my Bar Foo

But the regex I wrote (the|my) (?P<bar>.+?) Foo seems to be too greedy and collects more text than required (example at regex101.com)

edit: "Bar" is an exemplified string to match. In my real case scenario that could me made up of multiple words.

What am I doing wrong? Thanks!

I need to run this with the standard re python library.


Solution

  • Your main issue is that the regex engine searches for matches from left to right, and once my or the is found, the .+? will match as few chars other than line break chars as possible, but as many as necessary to complete a valid match.

    You need to match all text (using .*?) up to the last word (that can be matched with a \w+ pattern) before Foo:

    (the|my) .*?(?P<bar>\w+) Foo
    

    See the regex demo. Another variation is to match the or my as whole words and match any text up to the closest non-whitespace char chunk before Foo:

    \b(the|my)\b.*?(?P<bar>\S+)\s+Foo
    

    See this regex demo. Details:

    • \b(the|my)\b - the the or my word as a whole word
    • .*? - any zero or more chars other than line break chars, as few as possible
    • (?P<bar>\S+) - Group "bar": one or more non-whitespace chars
    • \s+ - one or more whitespace chars
    • Foo - a Foo string.