I want to replace every dot with a space in a sentence except when it is used with an abbreviation. When it is used with an abbreviation, I want to replace it with ''
NULL.
Abbreviation means a dot surrounded at least two Capital letters.
My regex
are working except they catch U.S.
r1 = r'\b((?:[A-Z]\.){2,})\s*'
r2 = r'(?:[A-Z]\.){2,}'
'U.S.A is abbr x.y is not. But I.I.T. is also valid ABBVR and so is M.Tech'
should become
'USA is abbr x y is not But IIT is also valid ABBVR and so is MTech'
UPDATE: It should not be considering any numbers or special characters.
X.2 -> X 2
X. -> X
X.* -> X -
You can use
import re
s='U.S.A is abbr x.y is not. But I.I.T. is also valid ABBVR and so is M.Tech, X.2, X., X.*'
print(re.sub(r'(?<=[A-Z])(\.)(?=[A-Z])|\.', lambda x: '' if x.group(1) else ' ', s))
# => USA is abbr x y is not But IIT is also valid ABBVR and so is MTech, X 2, X , X *
See the Python demo. Here is a regex demo. It matches
(?<=[A-Z])(\.)(?=[A-Z])
- Group 1: a .
char that is immediately preceded and followed with an uppercase ASCII letter|
- or\.
- a dot (in any other context)If Group 1 matches, the replacement is an empty string, else, the replacement is a space.
To make it Unicode-aware, install PyPi regex library (pip install regex
) and use
import regex
s='U.S.A is abbr x.y is not. But I.I.T. is also valid ABBVR and so is M.Tech, X.2, X., X.*'
print(regex.sub(r'(?<=\p{Lu})(\.)(?=\p{Lu})|\.', lambda x: '' if x.group(1) else ' ', s))
The \p{Lu}
matches any Unicode uppercase letter.