Search code examples
pythonregexpython-re

How can you get overlapping matches in regex?


Say I have a regex r"(([a-zA-Z]+)(&|\|)([a-zA-Z]+))", and a string "groupone|grouptwo|groupthree|groupfour".

If I run

re.findall(r"(([a-zA-Z]+)(&|\|)([a-zA-Z]+))", "groupone|grouptwo|groupthree|groupfour")

it returns:

[('groupone|grouptwo', 'groupone', '|', 'grouptwo'), ('groupthree|groupfour', 'groupthree', '|', 'groupfour')]

This is not my desired result. I would also like grouptwo and groupthree to be matched, like this:

[('groupone|grouptwo', 'groupone', '|', 'grouptwo'), ('grouptwo|groupthree', 'grouptwo', '|', 'groupthree'), ('groupthree|groupfour', 'groupthree', '|', 'groupfour')]

What do I need to correct about my regex to make this possible?


Solution

  • You could use the third-party regex module for this. Unlike the standard library re, it supports overlapping matches.

    import regex
    
    regex.findall(r"(\b([a-zA-Z]+\b)(&|\|)(\b[a-zA-Z]+)\b)", "groupone|grouptwo|groupthree|groupfour", overlapped=True)
    
    [('groupone|grouptwo', 'groupone', '|', 'grouptwo'),
     ('grouptwo|groupthree', 'grouptwo', '|', 'groupthree'),
     ('groupthree|groupfour', 'groupthree', '|', 'groupfour')]
    

    N.B. please note the addition of word boundaries (\b) in the pattern. If you were to keep your original pattern, you would get a bunch of unwanted matches as well using this method.