Search code examples
pythonregexnon-alphanumeric

How to remove leading and trailing non-alphanumeric characters of a certain string in python using regex?


How do I remove the leading and trailing non-alpha characters in a given string, before and after a certain substring? See the example below

input_string = m#12$my#tr!#$g%

output_string = m12my#tr!g

The substring, in this case, is my#tr!

How can get the output_string given the input_string?

My attempt below removes all the leading characters (including alphanumeric). See the code snippet below). I tried amending \W+ instead of .+ which did not work.

import re
input_string = "m#12$my#tr#$%"
output_string = re.sub(r'.+?(?=my#tr!)', '', "m#12$my#tr!#$g%")

Appreciate any thought on how I could use the regex pattern for this purpose.


Solution

  • One way to do this is to split the string around the desired substring, replace the non-alphanumeric characters in the first and last parts and then reassemble the string:

    import re
    
    input_string = "m#12$my#tr!#$g%"
    mid = 'my#tr!'
    first, last = input_string.split(mid)
    first = re.sub('[^a-z0-9]', '', first)
    last = re.sub('[^a-z0-9]', '', last)
    
    output_string = first + mid + last
    print(output_string)
    

    Output:

    m12my#tr!g
    

    If you use the regex module from PyPi, you can take advantage of variable length lookbehinds and replace any non-alphanumeric character that is before or after the target string:

    import regex
    
    input_string = "m#12$my#tr!#$g%"
    mid = 'my#tr!'
    output_string = regex.sub(rf'[^a-z0-9](?=.*{mid})|(?<={mid}.*)[^a-z0-9]', '', input_string)
    # 'm12my#tr!g'
    

    Note that if mid contains characters that are special to regex (e.g. . [ { $ ^ etc) you should escape it before use i.e.

    mid = 'my#tr!'
    mid = regex.escape(mid)
    

    If you don't want to use regex at all, you could manually strip the non-alphanumeric characters out of the first and last parts. For example:

    import string
    
    input_string = "m#12$my#tr!#$g%"
    mid = 'my#tr!'
    first, last = input_string.split(mid)
    first = ''.join(c for c in first if c in string.ascii_letters + string.digits)
    last = ''.join(c for c in last if c in string.ascii_letters + string.digits)
    output_string = first + mid + last