Search code examples
pythonstringpython-3.xtextset

Removing specific duplicated characters from a string in Python


How i can delete specific duplicated characters from a string only if they goes one after one in Python? For example:

A have string

string = "Hello _my name is __Alex"

I need to delete duplicate _ only if they goes one after one __ and get string like this:

string = "Hello _my name is _Alex"

If i use set i got this:

string = "_yoiHAemnasxl"

Solution

  • (Big edit: oops, I missed that you only want to de-deuplicate certain characters and not others. Retrofitting solutions...)

    I assume you have a string that represents all the characters you want to de-duplicate. Let's call it to_remove, and say that it's equal to "_.-". So only underscores, periods, and hyphens will be de-duplicated.

    You could use a regex to match multiple successive repeats of a character, and replace them with a single character.

    >>> import re
    >>> to_remove = "_.-"
    >>> s = "Hello... _my name -- is __Alex"
    >>> pattern = "(?P<char>[" + re.escape(to_remove) + "])(?P=char)+"
    >>> re.sub(pattern, r"\1", s)
    'Hello. _my name - is _Alex'
    

    Quick breakdown:

    • ?P<char> assigns the symbolic name char to the first group.
    • we put to_remove inside the character matching set, []. It's necessary to call re.escape because hyphens and other characters may have special meaning inside the set otherwise.
    • (?P=char) refers back to the character matched by the named group "char".
    • The + matches one or more repetitions of that character.

    So in aggregate, this means "match any character from to_remove that appears more than once in a row". The second argument to sub, r"\1", then replaces that match with the first group, which is only one character long.


    Alternative approach: write a generator expression that takes only characters that don't match the character preceding them.

    >>> "".join(s[i] for i in range(len(s)) if i == 0 or not (s[i-1] == s[i] and s[i] in to_remove))
    'Hello. _my name - is _Alex'
    

    Alternative approach #2: use groupby to identify consecutive identical character groups, then join the values together, using to_remove membership testing to decide how many values should be added..

    >>> import itertools
    >>> "".join(k if k in to_remove else "".join(v) for k,v in itertools.groupby(s, lambda c: c))
    'Hello. _my name - is _Alex'
    

    Alternative approach #3: call re.sub once for each member of to_remove. A bit expensive if to_remove contains a lot of characters.

    >>> for c in to_remove:
    ...     s = re.sub(rf"({re.escape(c)})\1+", r"\1", s)
    ...
    >>> s
    'Hello. _my name - is _Alex'