Search code examples
pythonregexpython-re

Is there any way to have re.sub report out on every replacement it makes?


TL;DR: How to get re.sub to print out what substitutions it makes, including when using groups?

Kind of like having a verbose option, is it possible to have re.sub print out a message every time it makes a replacement? This would be very helpful for testing how multiple lines of re.sub is interacting with large texts.

I've managed to come up with this workaround for simple replacements utilizing the fact that the repl argument can be a function:

import re

def replacer(text, verbose=False):
    def repl(matchobj, replacement):
        if verbose:
            print(f"Replacing {matchobj.group()} with {replacement}...")
        return replacement
    text = re.sub(r"[A-Z]+", lambda m: repl(m, "CAPS"), text)
    text = re.sub(r"\d+", lambda m: repl(m, "NUMBER"), text)
    return text

replacer("this is a 123 TEST 456", True)

# Log:
#   Replacing TEST with CAPS...
#   Replacing 123 with NUMBER...
#   Replacing 456 with NUMBER...

However, this doesn't work for groups--it seems re.sub automatically escapes the return value of repl:

def replacer2(text, verbose=False):
    def repl(matchobj, replacement):
        if verbose:
            print(f"Replacing {matchobj.group()} with {replacement}...")
        return replacement
    text = re.sub(r"([A-Z]+)(\d+)", lambda m: repl(m, r"\2\1"), text)
    return text

replacer2("ABC123", verbose=True) # returns r"\2\1"

# Log:
#   Replacing ABC123 with \2\1...

Of course, a more sophisticated repl function can be written that actually checks for groups in replacement, but at that point that solution seems too complicated for the goal of just getting re.sub to report out on substitutions. Another potential solution would be to just use re.search, report out on that, then use re.sub to make the replacement, potentially using the Pattern.sub variant in order to specify pos and endpos to save the sub function from having to search the whole string again. Surely there's a better way than either of these options?


Solution

  • Use matchobj.expand(replacement) which will process the replacement string and make the substitutions:

    import re
    
    def replacer2(text, verbose=False):
        def repl(matchobj, replacement):
            result = matchobj.expand(replacement)
            if verbose:
                print(f"Replacing {matchobj.group()} with {result}...")
            return result
        text = re.sub(r"([A-Z]+)(\d+)", lambda m: repl(m, r"\2\1"), text)
        return text
    
    print(replacer2("ABC123", verbose=True)
    

    Output:

    Replacing ABC123 with 123ABC...
    123ABC
    

    A generic example that extends re.sub with a verbose option and allows group patterns to be used by replacement functions:

    import re
    
    def sub2(pattern, repl, string, count=0, flags=0, verbose=False):
        def helper(match, repl):
            result = match.expand(repl(match) if callable(repl) else repl)
            if verbose:
                print(f'offset {match.start()}: {match.group()!r} -> {result!r}')
            return result
        return re.sub(pattern, lambda m: helper(m, repl), string, count, flags)
    
    # replace three digits with their reverse
    print(sub2(r'(\d)(\d)(\d)', r'\3\2\1', 'abc123def45ghi789', verbose=True))
    # replace three digits with their reverse, and two digits wrap with parentheses
    print(sub2(r'(\d)(\d)(\d)?',
               lambda m: r'(\1\2)' if m.group(3) is None else r'\3\2\1', 
               'abc123def45ghi789', verbose=True))
    

    Output:

    offset 3: '123' -> '321'
    offset 14: '789' -> '987'
    abc321def45ghi987
    offset 3: '123' -> '321'
    offset 9: '45' -> '(45)'
    offset 14: '789' -> '987'
    abc321def(45)ghi987