im trying to do data cleaning from instagram using python.
i need to remove those duplicate letters, but on (a,g) only remove them until there are 2 duplicate letters (aa,gg)
so it looks like this
input : mengganti, maaf, ppuuutttiiiihhh, mmmmeeeeerrrraaaah, maaagggz
output desired : mengganti, maaf, putih, meraah, maaggz
what im currently doing with regex is like this:
re.compile(r'(.)\1{1,}', re.IGNORECASE).sub(r'\1',kalimat)
input : mengganti, maaf, ppuuutttiiiihhh, mmmmeeeeerrrraaaah, maaagggz
currrent output : menganti, maf, putih, merah,magz
NB: it doesnt have to use regex
You can first capture a
and g
and replace with 2 times group 1.
([ag])\1+
The pattern matches:
([ag])
Capture group 1, match either a
or g
\1+
Repeat 1+ times the same char matched in group 1Then replace all chars other than a
g
or a whitespace char, and replace with a single group 1 to remove the duplicates.
([^\sag])\1+
The pattern matches:
(
Capture group 1
[^\sag]
Match a non whitespace char except for a or g)
Close group 1\1+
Repeat 1+ times the same char matched in group 1For example
import re
s = "mengganti, maaf, ppuuutttiiiihhh, mmmmeeeeerrrraaaah, maaagggz"
print(re.sub(
r"([^\sag])\1+",
r"\1",
re.sub(r"([ag])\1+", r"\1\1", s))
)
Output
mengganti, maaf, putih, meraah, maaggz
See a Python demo.
Or using a single pattern with an alternation |
combining the 2 patterns having 2 capture groups, and using re.sub with a lambda:
import re
pattern = r"([ag])\1+|([^\sag])\2+"
s = "mengganti, maaf, ppuuutttiiiihhh, mmmmeeeeerrrraaaah, maaagggz"
result = re.sub(
pattern,
lambda x: x.group(1) * 2 if x.group(1) else x.group(2),
s
)
if result:
print(result)
Output
mengganti, maaf, putih, meraah, maaggz
See another Python demo or a regex demo