Search code examples
pythonregexsplit

Separate a string between each two neighbouring different digits via `re.split` DIRECTLY (in Python)?


For instance, I'd like to convert "91234 5g5567\t7₇89^" into ["9","1","2","3","4 5g55","67\t7₇8","9^"]. Of course this can be done in a for loop without using any regular expressions, but I want to know if this can be done via a singular regular expression. At present I find two ways to do so:

>>> import re
>>> def way0(char: str):
...     delimiter = ""
...     while True:
...         delimiter += " "
...         if delimiter not in char:
...             substitution = re.compile("([0-9])(?!\\1)([0-9])")
...             replacement = "\\1"+delimiter+"\\2"
...             cin = [char]
...             while True:
...                 cout = []
...                 for term in cin: cout.extend(substitution.sub(replacement,term).split(delimiter))
...                 if cout == cin:
...                     return cin
...                 else:
...                     cin = cout
...
>>> way0("91234 5g5567\t7₇89^")
['9', '1', '2', '3', '4 5g55', '67\t7₇8', '9^']
>>> import functools
>>> way1 = lambda w: ["".join(list(y)) for x, y in itertools.groupby(re.split("(0+|1+|2+|3+|4+|5+|6+|7+|8+|9+)", w), lambda z: z != "") if x]
>>> way1("91234 5g5567\t7₇89^")
['9', '1', '2', '3', '4 5g55', '67\t7₇8', '9^']

However, neither way0 nor way1 is concise (and ideal). I have read the help page of re.split; unfortunately, the following code does not return the desired output:

>>> re.split(r"(\d)(?!\1)(\d)","91234 5g5567\t7₇89^")
['', '9', '1', '', '2', '3', '4 5g5', '5', '6', '7\t7₇', '8', '9', '^']

Can re.split solve this problem directly (that is, without extra conversions)? (Note that here I don't focus on the efficiency.)

There are some questions of this topic before (for example, Regular expression of two digit number where two digits are not same, Regex to match 2 digit but different numbers, and Regular expression to match sets of numbers that are not equal nor reversed), but they are about "RegMatch". In fact, my question is about "RegSplit" (rather than "RegMatch" or "RegReplace").


Solution

  • If you want to solve this using re.split without capturing and any further processing in one step, an idea is to use only lookarounds and in the lookbehind disallow two same digits looking ahead.

    (?=[0-9])(?<=(?!00|11|22|33|44|55|66|77|88|99)[0-9])
    

    See this demo at regex101 or the Python demo at tio.run

    The way it works is obvious. The lookarounds find any position between two digits. Inside the lookbehind the negative lookahead prevents matching (before) if two same digits are ahead.

    I used [0-9] and not \d because unsure if \d matches unicode digits in your Python version.