Search code examples
pythonregexregex-lookaroundslookbehind

Python regex lookbehind inconsistency?


When I run the following code:

import re

s = 'baaaad'

l = re.findall(r'((a)(?=a))', s)
print l
for elem in l:
    print ''.join(elem)

I get the output:

[('a', 'a'), ('a', 'a'), ('a', 'a')] aa aa aa

which is as expected. But when I try the corresponding strategy for lookbehind ie:

s = 'baaaad'

l = re.findall(r'((?<=b)(a))', s)
    print l
for elem in l:
    print ''.join(elem)

I get:

[('a', 'a')] aa

I was expecting to get:

[('b', 'a')] ba

Why this (to me) unexpected behavior? If I am doing something wrong, what is it? And how to fix it?

Thanks!


Solution

  • You seem to think that one of the groups in the output is from (a), and the other is from the lookahead or lookbehind. That's not the case. One of the groups is (a), and the other is from the parentheses surrounding your entire regex:

     v    v not these
    ((?<=b)(a))
    ^         ^ these
    

    The lookahead does not match a, and the lookbehind does not match b. They match a position in the string after which an a occurs, or before which a b occurs. They don't match any actual characters. Thus, both your regexes only match a, with restrictions on what might come before or after, and both capturing groups in both regexes only capture a.