About re.findall

Below are my python code:

import re

msg = '''[email protected] [email protected]'''
pattern = r'''(
        [a-zA-Z0-9_.]+     
        @                           
        [a-zA-Z0-9-.]+      
        \.                           
        [a-zA-Z]{2,4}       
        (\.)?                      
        ([a-zA-Z]{2,4})?  
        )'''
email = re.findall(pattern, msg, re.VERBOSE)
print(email)

I ran it in the python shell and I got the result below:

[('[email protected]', '', ''), ('[email protected]', '', '')]

My question is why the 2nd and 3rd elements in 1st tuple are empty? I thought in the 2nd and 3rd elements in the 1st tuple would be "." and "tw".

Do I mis-understand anything?

Solution

Your first character class after the @ ([a-zA-Z0-9-.]) includes the literal . and it is matched greedily, meaning it will go as far as it can instead of stopping as soon as it can.

You can avoid this by either matching it non-greedily ([...]+?) or removing the dot, thus allowing the rest of the regexp to match.

Code:

>>> import re
>>> msg = '''[email protected] [email protected]'''
>>> pattern2 = r'''(
...         [a-zA-Z0-9_.]+
...         @
...         [a-zA-Z0-9-]+
...         \.
...         [a-zA-Z]{2,4}
...         (\.)?
...         ([a-zA-Z]{2,4})?
...         )'''
>>> re.findall(pattern2, msg, re.VERBOSE)
[('[email protected]', '.', 'tw'), ('[email protected]', '', '')]
>>> pattern3 = r'''(
...         [a-zA-Z0-9_.]+
...         @
...         [a-zA-Z0-9-.]+?
...         \.
...         [a-zA-Z]{2,4}
...         (\.)?
...         ([a-zA-Z]{2,4})?
...         )'''
>>> re.findall(pattern3, msg, re.VERBOSE)
[('[email protected]', '.', 'tw'), ('[email protected]', '', '')]