Search code examples
pythonregexstrip

Strip punctuation with regular expression - python


I would like to strip all of the the punctuations (except the dot) from the beginning and end of a string, but not in the middle of it.

For instance for an original string:

@#%%.Hol$a.A.$%

I would like to get the word .Hol$a.A. removed from the end and beginning but not from the middle of the word.

Another example could be for the string:

@#%%...&Hol$a.A....$%

In this case the returned string should be ..&Hol$a.A.... because we do not care if the allowed characters are repeated.

The idea is to remove all of the punctuations( except the dot ) just at the beginning and end of the word. A word is defined as \w and/or a .

A practical example is the string 'Barnes&Nobles'. For text analysis is important to recognize Barnes&Nobles as a single entity, but without the '

How to accomplish the goal using Regex?


Solution

  • Use this simple and easily adaptable regex:

    [\w.].*[\w.]
    

    It will match exactly your desired result, nothing more.

    • [\w.] matches any alphanumeric character and the dot
    • .* matches any character (except newline normally)
    • [\w.] matches any alphanumeric character and the dot

    To change the delimiters, simply change the set of allowed characters inside the [] brackets.

    Check this regex out on regex101.com

    import re
    data = '@#%%.Hol$a.A.$%'
    pattern = r'[\w.].*[\w.]'
    print(re.search(pattern, data).group(0))
    # Output: .Hol$a.A.