python regex nlp data-cleaning python-re

How Replace a dot (.) in sentence except when it appears in an abbreviation using regular Expression

I want to replace every dot with a space in a sentence except when it is used with an abbreviation. When it is used with an abbreviation, I want to replace it with '' NULL.

Abbreviation means a dot surrounded at least two Capital letters.

My regex are working except they catch U.S.

r1 = r'\b((?:[A-Z]\.){2,})\s*'
r2 = r'(?:[A-Z]\.){2,}'

'U.S.A is abbr  x.y  is not. But I.I.T. is also valid ABBVR and so is M.Tech'

should become

'USA is abbr  x y  is not But IIT is also valid ABBVR and so is MTech'

UPDATE: It should not be considering any numbers or special characters.

X.2 -> X 2
X. -> X 
X.* -> X -

Solution

You can use

import re
s='U.S.A is abbr  x.y  is not. But I.I.T. is also valid ABBVR and so is M.Tech, X.2, X., X.*'
print(re.sub(r'(?<=[A-Z])(\.)(?=[A-Z])|\.', lambda x: '' if x.group(1) else ' ', s))
# =>  USA is abbr  x y  is not  But IIT  is also valid ABBVR and so is MTech, X 2, X , X *

See the Python demo. Here is a regex demo. It matches

(?<=[A-Z])(\.)(?=[A-Z]) - Group 1: a . char that is immediately preceded and followed with an uppercase ASCII letter
| - or
\. - a dot (in any other context)

If Group 1 matches, the replacement is an empty string, else, the replacement is a space.

To make it Unicode-aware, install PyPi regex library (pip install regex) and use

import regex
s='U.S.A is abbr  x.y  is not. But I.I.T. is also valid ABBVR and so is M.Tech, X.2, X., X.*'
print(regex.sub(r'(?<=\p{Lu})(\.)(?=\p{Lu})|\.', lambda x: '' if x.group(1) else ' ', s))

The \p{Lu} matches any Unicode uppercase letter.