python python-3.x regex regex-group python-re

Placing only digits in capture groups when converting a string to an array of ints

Bakground

So this question was inspired by the following question on codereview: Converting a string to an array of integers. Which opens as follows:

I am dealing with a string draw_result that can be in one of the following formats:
"03-23-27-34-37, Mega Ball: 13" 
"01-12 + 08-20" 
"04-15-17-25-41"
I always start with draw_result where the value is one from the above values. I want to get to:
[3, 23, 27, 34, 37] 
[1, 12, 8, 20]
[4, 15, 17, 25, 41]

This question can be solved with multiple regex expressions as follows

import re
from typing import Iterable

lottery_searches = [
    re.compile(pat).match
    for pat in (
        r'^(\d\d)-(\d\d)-(\d\d)-(\d\d)-(\d\d), Mega Ball.*$',
        r'^(\d\d)-(\d\d) \+ (\d\d)-(\d\d)$',
        r'^(\d\d)-(\d\d)-(\d\d)-(\d\d)-(\d+)$',
    )
]


def lottery_string_to_ints(lottery: str) -> Iterable[int]:
    for search in lottery_searches:
        if match := search(lottery):
            return (int(g) for g in match.groups())

    raise ValueError(f'"{lottery}" is not a valid lottery string')

Question

Assume that we allow a different seperator than - but they must all be equal. For instance
```
"04/15/17/25/41"
"04,15,17,25,41"
"01,12 + 08,20"
"01?12 + 08?20"
```
are all valid formats.
Is it now possible to only have the digits in capture groups? Is it possible to mark all digit capture groups in some way for easy retrieval?

Attempt at solution

Regex

PATTERN = re.compile(
    r"""
         (?P<digit0>\d\d)                   # Matches a double digit [00..99] and names it digit0
         (?P<sep>-)                         # Matches any one digit character - saves it as sep
         (?P<digit1>\d\d)                   # Matches a double digit [00..99] and names it digit1
         (\s+\+\s+|(?P=sep))                # Matches SPACE + SPACE OR the seperator saved in sep (-)
         (?P<digit2>\d\d)                   # Matches a double digit [00..99] and names it digit2
         (?P=sep)                           # Matches any one digit character - saves it as sep
         (?P<digit3>\d\d)                   # Matches a double digit [00..99] and names it digit3
         ((?P=sep)(?P<digit4>\d\d))?        # Checks if there is a final fifth digit (-01), saves to digit5
        """,
    re.VERBOSE,
)

Retrieval

def extract_numbers_narrow(draw_result, digits=5):
    numbers = []
    if match := re.match(PATTERN2, draw_result):
        for i in range(digits):
            ith_digit = f"digit{i}"
            try:
                number = int(match.group(ith_digit))
            except IndexError:  # Catches if the group does not exists
                continue
            except TypeError:  # Catches if the group is None
                continue
            numbers.append(number)
    return numbers

Problem: I had to include a dirty try statement as the fifth digit might or might not appear in the result.

Solution

It seems you want to get all numbers before a comma. You can use this PyPi regex based solution

import regex

texts = ['03-23-27-34-37, Mega Ball: 13', '01-12 + 08-20', '04-15-17-25-41']
reg = regex.compile(r'^(?:[^\w,]*(\d+))+')

for text in texts:
    match = reg.search(text)
    if match:
        print( text, '=>', list(map(int,match.captures(1))) )

See the online Python demo.

The ^(?:[^\w,]*(\d+))+ regex matches one or more sequences of any zero or more chars other than word and comma chars followed with one or more digits (captured into Group 1) at the start of string. Since regex keeps a stack for each capturing group, you can access all captured numbers with .captures().

If you need to do it with built-in re, you can use

import re
 
texts = ['03-23-27-34-37, Mega Ball: 13', '01-12 + 08-20', '04-15-17-25-41']
reg = re.compile(r'^(?:[^\w,]*\d+)+')
 
for text in texts:
    match = reg.search(text)
    if match:
        print( text, '=>', list(map(int,re.findall(r'\d+', match.group()))) )

See this Python demo where re.findall(r'\d+'...) extracts the numbers from the match value.

Both output:

03-23-27-34-37, Mega Ball: 13 => [3, 23, 27, 34, 37]
01-12 + 08-20 => [1, 12, 8, 20]
04-15-17-25-41 => [4, 15, 17, 25, 41]