Search code examples
pythonregexpython-typing

Regex matching "|" separated values for Union types


I'm trying to match type annotations like int | str, and use regex substitution to replace them with a string Union[int, str].

Desired substitutions (before and after):

  • str|int|bool -> Union[str,int,bool]
  • Optional[int|tuple[str|int]] -> Optional[Union[int,tuple[Union[str,int]]]]
  • dict[str | int, list[B | C | Optional[D]]] -> dict[Union[str,int], list[Union[B,C,Optional[D]]]]

The regular expression I've come up with so far is as follows:

r"\w*(?:\[|,|^)[\t ]*((?'type'[a-zA-Z0-9_.\[\]]+)(?:[\t ]*\|[\t ]*(?&type))+)(?:\]|,|$)"

You can try it out here on Regex Demo. It's not really working how I'd want it to. The problems I've noted so far:

  • It doesn't seem to handle nested Union conditions so far. For example, int | tuple[str|int] | bool seems to result in one match, rather than two matches (including the inner Union condition).

  • The regex seems to consume unnecessary ] at the end.

  • Probably the most important one, but I noticed the regex subroutines don't seem to be supported by the re module in Python. Here is where I got the idea to use that from.

Additional Info

This is mainly to support the PEP 604 syntax for Python 3.7+, which requires annotatations to be forward-declared (e.g. declared as strings) to be supported, as otherwise builtin types don't support the | operator.

Here's a sample code that I came up with:

from __future__ import annotations

import datetime
from decimal import Decimal
from typing import Optional


class A:
    field_1: str|int|bool
    field_2: int  |  tuple[str|int]  |  bool
    field_3: Decimal|datetime.date|str
    field_4: str|Optional[int]
    field_5: Optional[int|str]
    field_6: dict[str | int, list[B | C | Optional[D]]]

class B: ...
class C: ...
class D: ...

For Python versions earlier than 3.10, I use a __future__ import to avoid the error below:

TypeError: unsupported operand type(s) for |: 'type' and 'type'

This essentially converts all annotations to strings, as below:

>>> A.__annotations__
{'field_1': 'str | int | bool', 'field_2': 'int | tuple[str | int] | bool', 'field_3': 'Decimal | datetime.date | str', 'field_4': 'str | Optional[int]', 'field_5': 'Optional[int | str]', 'field_6': 'dict[str | int, list[B | C | Optional[D]]]'}

But in code (say in another module), I want to evaluate the annotations in A. This works in Python 3.10, but fails in Python 3.7+ even though the __future__ import supports forward declared annotations.

>>> from typing import get_type_hints
>>> hints = get_type_hints(A)

Traceback (most recent call last):
    eval(self.__forward_code__, globalns, localns),
  File "<string>", line 1, in <module>
TypeError: unsupported operand type(s) for |: 'type' and 'type'

It seems the best approach to make this work, is to replace all occurrences of int | str (for example) with Union[int, str], and then with typing.Union included in the additional localns used to evaluate the annotations, it should then be possible to evaluate PEP 604- style annotations for Python 3.7+.


Solution

  • You can install the PyPi regex module (as re does not support recursion) and use

    import regex
    text = "str|int|bool\nOptional[int|tuple[str|int]]\ndict[str | int, list[B | C | Optional[D]]]"
    rx = r"(\w+\[)(\w+(\[(?:[^][|]++|(?3))*])?(?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+)]"
    n = 1
    res = text
    while n != 0:
        res, n = regex.subn(rx, lambda x: "{}Union[{}]]".format(x.group(1), regex.sub(r'\s*\|\s*', ',', x.group(2))), res) 
    
    print( regex.sub(r'\w+(?:\s*\|\s*\w+)+', lambda z: "Union[{}]".format(regex.sub(r'\s*\|\s*', ',', z.group())), res) )
    

    Output:

    Union[str,int,bool]
    Optional[Union[int,tuple[Union[str,int]]]]
    dict[Union[str,int], list[Union[B,C,Optional[D]]]]
    

    See the Python demo.

    The first regex finds all kinds of WORD[...] that contain pipe chars and other WORDs or WORD[...] with no pipe chars inside them.

    The \w+(?:\s*\|\s*\w+)+ regex matches 2 or more words that are separated with pipes and optional spaces.

    The first pattern details:

    • (\w+\[) - Group 1 (this will be kept as is at the beginning of the replacement): one or more word chars and then a [ char
    • (\w+(\[(?:[^][|]++|(?3))*])?(?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+) - Group 2 (it will be put inside Union[...] with all \s*\|\s* pattern replaced with ,):
      • \w+ - one or more word chars
      • (\[(?:[^][|]++|(?3))*])? - an optional Group 3 that matches a [ char, followed with zero or more occurrences of one or more [ or ] chars or whole Group 3 recursed (hence, it matches nested parentheses) and then a ] char
      • (?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+ - one or more occurrences (so the match contains at least one pipe char to replace with ,) of:
        • \s*\|\s* - a pipe char enclosed with zero or more whitespaces
        • \w+ - one or more word chars
        • (\[(?:[^][|]++|(?4))*])? - an optional Group 4 (matches the same thing as Group 3, note the (?4) subroutine repeats Group 4 pattern)
    • ] - a ] char.