How can I remove multiple consecutive occurrences of all the special characters in a string?
I can get the code like:
re.sub('\.\.+',' ',string)
re.sub('@@+',' ',string)
re.sub('\s\s+',' ',string)
for individual and in best case, use a loop for all the characters in a list like:
from string import punctuation
for i in punctuation:
to = ('\\' + i + '\\' + i + '+')
string = re.sub(to, ' ', string)
but I'm sure there is an effective method too.
I tried:
re.sub('[^a-zA-Z0-9][^a-zA-Z0-9]+', ' ', '\n\n.AAA.x.@@+*@#=..xx000..x..\t.x..\nx*+Y.')
but it removes all the special characters except one preceded by alphabets.
string can have different consecutive special characters like 99@aaaa*!@#$.
but not same like ++--...
.
A pattern to match all non-alphanumeric characters in Python is [\W_]
.
So, all you need is to wrap the pattern with a capturing group and add \1+
after it to match 2 or more consecutive occurrences of the same non-alphanumeric characters:
text = re.sub(r'([\W_])\1+',' ',text)
In Python 3.x, if you wish to make the pattern ASCII aware only, use the re.A
or re.ASCII
flag:
text = re.sub(r'([\W_])\1+',' ',text, flags=re.A)
Mind the use of the r
prefix that defines a raw string literal (so that you do not have to escape \
char).
See the regex demo. See the Python demo:
import re
text = "\n\n.AAA.x.@@+*@#=..xx000..x..\t.x..\nx*+Y."
print(re.sub(r'([\W_])\1+',' ',text))
Output:
.AAA.x. +*@#= xx000 x .x
x*+Y.