I'm trying to match type annotations like int | str
, and use regex substitution to replace them with a string Union[int, str]
.
Desired substitutions (before and after):
str|int|bool
-> Union[str,int,bool]
Optional[int|tuple[str|int]]
-> Optional[Union[int,tuple[Union[str,int]]]]
dict[str | int, list[B | C | Optional[D]]]
-> dict[Union[str,int], list[Union[B,C,Optional[D]]]]
The regular expression I've come up with so far is as follows:
r"\w*(?:\[|,|^)[\t ]*((?'type'[a-zA-Z0-9_.\[\]]+)(?:[\t ]*\|[\t ]*(?&type))+)(?:\]|,|$)"
You can try it out here on Regex Demo. It's not really working how I'd want it to. The problems I've noted so far:
It doesn't seem to handle nested Union conditions so far. For example, int | tuple[str|int] | bool
seems to result in one match, rather than two matches (including the inner Union condition).
The regex seems to consume unnecessary ]
at the end.
Probably the most important one, but I noticed the regex subroutines don't seem to be supported by the re
module in Python. Here is where I got the idea to use that from.
This is mainly to support the PEP 604 syntax for Python 3.7+, which requires annotatations to be forward-declared (e.g. declared as strings) to be supported, as otherwise builtin types don't support the |
operator.
Here's a sample code that I came up with:
from __future__ import annotations
import datetime
from decimal import Decimal
from typing import Optional
class A:
field_1: str|int|bool
field_2: int | tuple[str|int] | bool
field_3: Decimal|datetime.date|str
field_4: str|Optional[int]
field_5: Optional[int|str]
field_6: dict[str | int, list[B | C | Optional[D]]]
class B: ...
class C: ...
class D: ...
For Python versions earlier than 3.10, I use a __future__
import to avoid the error below:
TypeError: unsupported operand type(s) for |: 'type' and 'type'
This essentially converts all annotations to strings, as below:
>>> A.__annotations__
{'field_1': 'str | int | bool', 'field_2': 'int | tuple[str | int] | bool', 'field_3': 'Decimal | datetime.date | str', 'field_4': 'str | Optional[int]', 'field_5': 'Optional[int | str]', 'field_6': 'dict[str | int, list[B | C | Optional[D]]]'}
But in code (say in another module), I want to evaluate the annotations in A. This works in Python 3.10, but fails in Python 3.7+ even though the __future__
import supports forward declared annotations.
>>> from typing import get_type_hints
>>> hints = get_type_hints(A)
Traceback (most recent call last):
eval(self.__forward_code__, globalns, localns),
File "<string>", line 1, in <module>
TypeError: unsupported operand type(s) for |: 'type' and 'type'
It seems the best approach to make this work, is to replace all occurrences of int | str
(for example) with Union[int, str]
, and then with typing.Union
included in the additional localns
used to evaluate the annotations, it should then be possible to evaluate PEP 604- style annotations for Python 3.7+.
You can install the PyPi regex
module (as re
does not support recursion) and use
import regex
text = "str|int|bool\nOptional[int|tuple[str|int]]\ndict[str | int, list[B | C | Optional[D]]]"
rx = r"(\w+\[)(\w+(\[(?:[^][|]++|(?3))*])?(?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+)]"
n = 1
res = text
while n != 0:
res, n = regex.subn(rx, lambda x: "{}Union[{}]]".format(x.group(1), regex.sub(r'\s*\|\s*', ',', x.group(2))), res)
print( regex.sub(r'\w+(?:\s*\|\s*\w+)+', lambda z: "Union[{}]".format(regex.sub(r'\s*\|\s*', ',', z.group())), res) )
Output:
Union[str,int,bool]
Optional[Union[int,tuple[Union[str,int]]]]
dict[Union[str,int], list[Union[B,C,Optional[D]]]]
See the Python demo.
The first regex finds all kinds of WORD[...]
that contain pipe chars and other WORD
s or WORD[...]
with no pipe chars inside them.
The \w+(?:\s*\|\s*\w+)+
regex matches 2 or more words that are separated with pipes and optional spaces.
The first pattern details:
(\w+\[)
- Group 1 (this will be kept as is at the beginning of the replacement): one or more word chars and then a [
char(\w+(\[(?:[^][|]++|(?3))*])?(?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+)
- Group 2 (it will be put inside Union[...]
with all \s*\|\s*
pattern replaced with ,
):
\w+
- one or more word chars(\[(?:[^][|]++|(?3))*])?
- an optional Group 3 that matches a [
char, followed with zero or more occurrences of one or more [
or ]
chars or whole Group 3 recursed (hence, it matches nested parentheses) and then a ]
char(?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+
- one or more occurrences (so the match contains at least one pipe char to replace with ,
) of:
\s*\|\s*
- a pipe char enclosed with zero or more whitespaces\w+
- one or more word chars(\[(?:[^][|]++|(?4))*])?
- an optional Group 4 (matches the same thing as Group 3, note the (?4)
subroutine repeats Group 4 pattern)]
- a ]
char.