I have a large list of keywords and the groups they belong to. Some of the same keywords belong to different groups. I need to find all the named groups.
msg = 'ef'
tmp = re.compile(r'(?P<ABC>ab|ef)|(?P<EFG>cd|ef)')
res = [g.groupdict() for g in tmp.finditer(msg)]
[{'ABC': 'ef', 'EFG': None}]
At the same time, I can find all the groups in the following case:
msg = 'cd ab cd'
[{'ABC': None, 'EFG': 'cd'}, {'ABC': 'ab', 'EFG': None}, {'ABC': None, 'EFG': 'cd'}]
But I don't find all the groups. How do I find all the groups ("ABC", "EFG") if there is "ef" in the text?
You can capture the same text into different groups with lookarounds:
(?=(?P<ABC>ab|ef)?) # Lookahead and capture 'ab' or 'ef' into group ABC
# if either of them is there
(?=(?P<EFG>cd|ef)?) # then do the same with 'cd', 'ef' and group EFG,
(?:ab|cd|ef) # before matching the real thing: a union set of all keywords.
Try it on regex101.com (see the Match information panel).
Given that, writing a script to generate such a regex should be trivial:
from itertools import chain
import re
keywords_by_group = {
'ABC': ('ab', 'ef'),
'EFG': ('cd', 'ef'),
'HIJ': ('ef', 'hi', 'jk')
}
lookaheads = (
f'''(?=(?P<{group}>{'|'.join(keywords)})?)'''
for group, keywords in keywords_by_group.items()
)
regex = re.compile(
fr'''
{''.join(lookaheads)}
(?:{'|'.join(set(chain(*keywords_by_group.values())))})
''',
re.X
)
print(regex.pattern)
'''
(?=(?P<ABC>ab|ef)?)(?=(?P<EFG>cd|ef)?)(?=(?P<HIJ>ef|hi|jk)?)
(?:ab|cd|ef|hi|jk)
'''
Try it:
message = 'cd ab cd ef'
print([match.groupdict() for match in regex.finditer(message)])
'''
[
{'ABC': None, 'EFG': 'cd', 'HIJ': None},
{'ABC': 'ab', 'EFG': None, 'HIJ': None},
{'ABC': None, 'EFG': 'cd', 'HIJ': None},
{'ABC': 'ef', 'EFG': 'ef', 'HIJ': 'ef'}
]
'''