Suppose I have code like this:
import re
rx = re.compile(r'(?:(\w{2}) (\d{2})|(\w{4}) (\d{4}))')
def get_id(s):
g = rx.match(s).groups()
return (
g[0] if g[0] is not None else g[2],
int(g[1] if g[1] is not None else g[3]),
)
print(get_id('AA 12')) # ('AA', 12)
print(get_id('BBBB 1234')) # ('BBBB', 1234)
This does what I want, but it requires me to inspect every capture group in order to check which one actually captured the substring. This can become unwieldy if the number of alternatives is high, so I would rather avoid this.
I tried using named captures, but (?:(P<s>\w{2}) (?P<id>\d{2})|(?P<s>\w{4}) (?P<id>\d{4}))
just raises an error.
The trick in the answer to Unify capture groups for multiple cases in regex doesn’t work, as (\w{2}(?= \d{2})|\w{4}(?= \d{4})) (\d{2}|\d{4})
will capture the wrong amount of digits, and for reasons I’d rather not get into, I cannot hand-optimise the order of alternatives.
Is there a more idiomatic way to write this?
It seems there is! From re
documentation:
(?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
the (optional) no pattern otherwise.
Which makes the example this:
rx = re.compile(r'(?P<prefix>(?P<prefix_two>)\w{2}(?= \d{2})|(?P<prefix_four>)\w{4}(?= \d{4})) (?P<digits>(?(prefix_two))\d{2}|(?(prefix_four))\d{4})')
def get_id(s):
m = rx.match(s)
if not m:
return (None, None,)
return m.group('prefix', 'digits')
print(get_id('AA 12')) # ('AA', 12)
print(get_id('BB 1234')) # ('BB', 12)
print(get_id('BBBB 12')) # (None, None)
print(get_id('BBBB 1234')) # ('BBBB', 1234)
Whether it’s worth the trouble, I’ll leave up to the reader.