Say I have a regex r"(([a-zA-Z]+)(&|\|)([a-zA-Z]+))"
, and a string "groupone|grouptwo|groupthree|groupfour"
.
If I run
re.findall(r"(([a-zA-Z]+)(&|\|)([a-zA-Z]+))", "groupone|grouptwo|groupthree|groupfour")
it returns:
[('groupone|grouptwo', 'groupone', '|', 'grouptwo'), ('groupthree|groupfour', 'groupthree', '|', 'groupfour')]
This is not my desired result. I would also like grouptwo and groupthree to be matched, like this:
[('groupone|grouptwo', 'groupone', '|', 'grouptwo'), ('grouptwo|groupthree', 'grouptwo', '|', 'groupthree'), ('groupthree|groupfour', 'groupthree', '|', 'groupfour')]
What do I need to correct about my regex to make this possible?
You could use the third-party regex
module for this. Unlike the standard library re
, it supports overlapping matches.
import regex
regex.findall(r"(\b([a-zA-Z]+\b)(&|\|)(\b[a-zA-Z]+)\b)", "groupone|grouptwo|groupthree|groupfour", overlapped=True)
[('groupone|grouptwo', 'groupone', '|', 'grouptwo'),
('grouptwo|groupthree', 'grouptwo', '|', 'groupthree'),
('groupthree|groupfour', 'groupthree', '|', 'groupfour')]
N.B. please note the addition of word boundaries (\b
) in the pattern. If you were to keep your original pattern, you would get a bunch of unwanted matches as well using this method.