Search code examples
pythonregexregex-group

Regex match strings divided by 'and'


I need to parse a string to get desired number and position form a string, for example:

2 Better Developers and 3 Testers
5 Mechanics and chef
medic and 3 nurses

Currently I am using code like this which returns list of tuples, like [('2', 'Better Developers'), ('3', 'Testers')]:

def parse_workers_list_from_str(string_value: str) -> [(str, str)]:
    result: [(str, str)] = []
    if string_value:
        for part in string_value.split('and'):
            result.append(re.findall(r'(?: *)(\d+|)(?: |)([\w ]+)', part.strip())[0])
    return result

Can I do it without .split() using only regex?


Solution

  • Together with re.MULTILINE you can do everything in one regex, that will also split everything correctly:

    >>> s = """2 Better Developers and 3 Testers
    5 Mechanics and chef
    medic and 3 nurses"""
    >>> re.findall(r"\s*(\d*)\s*(.+?)(?:\s+and\s+|$)", s, re.MULTILINE)
    [('2', 'Better Developers'), ('3', 'Testers'), ('5', 'Mechanics'), ('', 'chef'), ('', 'medic'), ('3', 'nurses')]
    

    With explanation and conversion of empty '' to 1:

    import re
    
    s = """2 Better Developers and 3 Testers
    5 Mechanics and chef
    medic and 3 nurses"""
    
    results = re.findall(r"""
        # Capture the number if one exists
        (\d*)
        # Remove spacing between number and text
        \s*
        # Caputre the text
        (.+?)
        # Attempt to match the word 'and' or the end of the line
        (?:\s+and\s+|$\n?)
        """, s, re.MULTILINE|re.VERBOSE)
    
    results = [(int(n or 1), t.title()) for n, t in results]
    results == [(2, 'Better Developers'), (3, 'Testers'), (5, 'Mechanics'), (1, 'Chef'), (1, 'Medic'), (3, 'Nurses')]