Search code examples
pythonregexnlpdata-cleaningpython-re

How Replace a dot (.) in sentence except when it appears in an abbreviation using regular Expression


I want to replace every dot with a space in a sentence except when it is used with an abbreviation. When it is used with an abbreviation, I want to replace it with '' NULL.

Abbreviation means a dot surrounded at least two Capital letters.

My regex are working except they catch U.S.

r1 = r'\b((?:[A-Z]\.){2,})\s*'
r2 = r'(?:[A-Z]\.){2,}'

'U.S.A is abbr  x.y  is not. But I.I.T. is also valid ABBVR and so is M.Tech'

should become

'USA is abbr  x y  is not But IIT is also valid ABBVR and so is MTech'

UPDATE: It should not be considering any numbers or special characters.

X.2 -> X 2
X. -> X 
X.* -> X - 

Solution

  • You can use

    import re
    s='U.S.A is abbr  x.y  is not. But I.I.T. is also valid ABBVR and so is M.Tech, X.2, X., X.*'
    print(re.sub(r'(?<=[A-Z])(\.)(?=[A-Z])|\.', lambda x: '' if x.group(1) else ' ', s))
    # =>  USA is abbr  x y  is not  But IIT  is also valid ABBVR and so is MTech, X 2, X , X *
    

    See the Python demo. Here is a regex demo. It matches

    • (?<=[A-Z])(\.)(?=[A-Z]) - Group 1: a . char that is immediately preceded and followed with an uppercase ASCII letter
    • | - or
    • \. - a dot (in any other context)

    If Group 1 matches, the replacement is an empty string, else, the replacement is a space.

    To make it Unicode-aware, install PyPi regex library (pip install regex) and use

    import regex
    s='U.S.A is abbr  x.y  is not. But I.I.T. is also valid ABBVR and so is M.Tech, X.2, X., X.*'
    print(regex.sub(r'(?<=\p{Lu})(\.)(?=\p{Lu})|\.', lambda x: '' if x.group(1) else ' ', s))
    

    The \p{Lu} matches any Unicode uppercase letter.