Capture the same match into different groups

I have a large list of keywords and the groups they belong to. Some of the same keywords belong to different groups. I need to find all the named groups.

msg = 'ef'
tmp = re.compile(r'(?P<ABC>ab|ef)|(?P<EFG>cd|ef)')
res = [g.groupdict() for g in tmp.finditer(msg)]

[{'ABC': 'ef', 'EFG': None}]

At the same time, I can find all the groups in the following case:

msg = 'cd ab cd'

[{'ABC': None, 'EFG': 'cd'}, {'ABC': 'ab', 'EFG': None}, {'ABC': None, 'EFG': 'cd'}]

But I don't find all the groups. How do I find all the groups ("ABC", "EFG") if there is "ef" in the text?

Solution

You can capture the same text into different groups with lookarounds:

(?=(?P<ABC>ab|ef)?)   # Lookahead and capture 'ab' or 'ef' into group ABC
                      # if either of them is there
(?=(?P<EFG>cd|ef)?)   # then do the same with 'cd', 'ef' and group EFG,
(?:ab|cd|ef)          # before matching the real thing: a union set of all keywords.

Try it on regex101.com (see the Match information panel).

Given that, writing a script to generate such a regex should be trivial:

from itertools import chain
import re


keywords_by_group = {
  'ABC': ('ab', 'ef'),
  'EFG': ('cd', 'ef'),
  'HIJ': ('ef', 'hi', 'jk')
}

lookaheads = (
  f'''(?=(?P<{group}>{'|'.join(keywords)})?)'''
  for group, keywords in keywords_by_group.items()
)

regex = re.compile(
  fr'''
  {''.join(lookaheads)}
  (?:{'|'.join(set(chain(*keywords_by_group.values())))})
  ''',
  re.X
)

print(regex.pattern)

'''
(?=(?P<ABC>ab|ef)?)(?=(?P<EFG>cd|ef)?)(?=(?P<HIJ>ef|hi|jk)?)
(?:ab|cd|ef|hi|jk)
'''

Try it:

message = 'cd ab cd ef'
print([match.groupdict() for match in regex.finditer(message)])

'''
[
  {'ABC': None, 'EFG': 'cd', 'HIJ': None},
  {'ABC': 'ab', 'EFG': None, 'HIJ': None},
  {'ABC': None, 'EFG': 'cd', 'HIJ': None},
  {'ABC': 'ef', 'EFG': 'ef', 'HIJ': 'ef'}
]
'''