Search code examples
pythonregexnegative-lookbehind

Could you explain why this regex is not working?


>>> d = "Batman,Superman"
>>> m = re.search("(?<!Bat)\w+",d)
>>> m.group(0)
'Batman'

Why isn't group(0) matching Superman? This lookaround tutorial says:

(?<!a)b matches a "b" that is not preceded by an "a", using negative lookbehind


Solution

  • At a simple level, the regex engine starts from the left of the string and moves progressively towards the right, trying to match your pattern (think of it like a cursor moving through the string). In the case of a lookaround, at each stop of the cursor, the lookaround is asserted, and if true, the engine continues trying to make a match. As soon as the engine can match your pattern, it'll return a match.

    At position 0 of your string (ie. prior to the B in Batman), the assertion succeeded, as Bat is not present before the current position - thus, \w+ can match the entire word Batman (remember, regexes are inherently greedy - ie. will match as much as possible).

    See this page for more information on engine internals.


    To achieve what you wanted, you could instead use something like:

    \b(?!Bat)\w+
    

    In this pattern, the engine will match a word boundary (\b)1, followed by one or more word characters, with the assertion that the word characters do not start with Bat. A lookahead is used rather than a lookbehind because using a lookbehind here would have the same problem as your original pattern; it would look before the position directly following the word boundary, and since its already been determined that the position before the cursor is a word boundary, the negative lookbehind would always succeed.

    1 Note that word boundaries match a boundary between \w and \W (ie. between [A-Za-z0-9_] and any other character; it also matches the ^ and $ anchors). If your boundaries need to be more complex, you'll need a different way of anchoring your pattern.