Search code examples
pythonregexpython-re

Capture the same match into different groups


I have a large list of keywords and the groups they belong to. Some of the same keywords belong to different groups. I need to find all the named groups.

msg = 'ef'
tmp = re.compile(r'(?P<ABC>ab|ef)|(?P<EFG>cd|ef)')
res = [g.groupdict() for g in tmp.finditer(msg)]

[{'ABC': 'ef', 'EFG': None}]

At the same time, I can find all the groups in the following case:

msg = 'cd ab cd'

[{'ABC': None, 'EFG': 'cd'}, {'ABC': 'ab', 'EFG': None}, {'ABC': None, 'EFG': 'cd'}]

But I don't find all the groups. How do I find all the groups ("ABC", "EFG") if there is "ef" in the text?


Solution

  • You can capture the same text into different groups with lookarounds:

    (?=(?P<ABC>ab|ef)?)   # Lookahead and capture 'ab' or 'ef' into group ABC
                          # if either of them is there
    (?=(?P<EFG>cd|ef)?)   # then do the same with 'cd', 'ef' and group EFG,
    (?:ab|cd|ef)          # before matching the real thing: a union set of all keywords.
    

    Try it on regex101.com (see the Match information panel).

    Given that, writing a script to generate such a regex should be trivial:

    from itertools import chain
    import re
    
    
    keywords_by_group = {
      'ABC': ('ab', 'ef'),
      'EFG': ('cd', 'ef'),
      'HIJ': ('ef', 'hi', 'jk')
    }
    
    lookaheads = (
      f'''(?=(?P<{group}>{'|'.join(keywords)})?)'''
      for group, keywords in keywords_by_group.items()
    )
    
    regex = re.compile(
      fr'''
      {''.join(lookaheads)}
      (?:{'|'.join(set(chain(*keywords_by_group.values())))})
      ''',
      re.X
    )
    
    print(regex.pattern)
    
    '''
    (?=(?P<ABC>ab|ef)?)(?=(?P<EFG>cd|ef)?)(?=(?P<HIJ>ef|hi|jk)?)
    (?:ab|cd|ef|hi|jk)
    '''
    

    Try it:

    message = 'cd ab cd ef'
    print([match.groupdict() for match in regex.finditer(message)])
    
    '''
    [
      {'ABC': None, 'EFG': 'cd', 'HIJ': None},
      {'ABC': 'ab', 'EFG': None, 'HIJ': None},
      {'ABC': None, 'EFG': 'cd', 'HIJ': None},
      {'ABC': 'ef', 'EFG': 'ef', 'HIJ': 'ef'}
    ]
    '''