Search code examples
pythonregexdata-cleaning

How can i delete duplicate letter, but adding some exception letter to it


im trying to do data cleaning from instagram using python.
i need to remove those duplicate letters, but on (a,g) only remove them until there are 2 duplicate letters (aa,gg)

so it looks like this
input : mengganti, maaf, ppuuutttiiiihhh, mmmmeeeeerrrraaaah, maaagggz
output desired : mengganti, maaf, putih, meraah, maaggz

what im currently doing with regex is like this:

re.compile(r'(.)\1{1,}', re.IGNORECASE).sub(r'\1',kalimat)

input : mengganti, maaf, ppuuutttiiiihhh, mmmmeeeeerrrraaaah, maaagggz
currrent output : menganti, maf, putih, merah,magz
NB: it doesnt have to use regex


Solution

  • You can first capture a and g and replace with 2 times group 1.

    ([ag])\1+
    

    The pattern matches:

    • ([ag]) Capture group 1, match either a or g
    • \1+ Repeat 1+ times the same char matched in group 1

    Then replace all chars other than a g or a whitespace char, and replace with a single group 1 to remove the duplicates.

    ([^\sag])\1+
    

    The pattern matches:

    • ( Capture group 1
      • [^\sag] Match a non whitespace char except for a or g
    • ) Close group 1
    • \1+ Repeat 1+ times the same char matched in group 1

    For example

    import re
    
    s = "mengganti, maaf, ppuuutttiiiihhh, mmmmeeeeerrrraaaah, maaagggz"
    
    print(re.sub(
            r"([^\sag])\1+",
            r"\1",
            re.sub(r"([ag])\1+", r"\1\1", s))
    )
    

    Output

    mengganti, maaf, putih, meraah, maaggz
    

    See a Python demo.


    Or using a single pattern with an alternation | combining the 2 patterns having 2 capture groups, and using re.sub with a lambda:

    import re
    
    pattern = r"([ag])\1+|([^\sag])\2+"
    s = "mengganti, maaf, ppuuutttiiiihhh, mmmmeeeeerrrraaaah, maaagggz"
    result = re.sub(
            pattern,
            lambda x: x.group(1) * 2 if x.group(1) else x.group(2),
            s
    )
    
    if result:
            print(result)
    

    Output

    mengganti, maaf, putih, meraah, maaggz
    

    See another Python demo or a regex demo