Search code examples
pythonregexcharacteroutlook-addinregex-lookarounds

re.sub() with a non-empty substitution eats up following character in Python


I'm trying to separate a word with two adjacent vowels by inserting a non-alphabetic group of characters. When I use re.sub() with a non-empty substitution, the result shows the insertion but the insertion seems to have "eaten up" the following character.

Here's an example"

import = re

word = "aorta"

re.sub('(?<=[AEOUaeouy])(?:[aeoui])', '[=]', word)
#actual output => 'a[=]r[=]ta'
#expected output => 'a[=]or[=]ta'

Why is the character following the insertion eaten up?


Solution

  • You should use a positive lookahead (a non-consuming pattern that only checks for the presence of some chars without actually adding them to the match value), not a non-capturing group (a consuming pattern that puts the matched chars into the match value that get replaced with re.sub).

    Use

    import re
    word = "aorta"
    print(re.sub('([AEOUaeouy])(?=[aeoui])', r'\1[=]', word))
    # => a[=]orta
    

    See the Python demo.

    Note: if you wish to get 'a[=]or[=]ta', add r to the lookbehind character class, [AEOUaeouy] => [AEOUaeouyr].

    Details

    • ([AEOUaeouy]) - Group 1: any one of the chars defined in the pattern
    • (?=[aeoui]) - a position that is followed with the chars in the character class
    • \1 - in the replacement pattern, inserts the value captured with Group 1.