Search code examples
pythonregextextpreprocessor

Replace all consecutive repeated letters ignoring specific words


I saw plenty of suggestions to remove consecutively repeated letters in a sentence either using re (regex) or .join in python, but I want to have exception for special words.

E.g.:

I want this sentence > sentence = 'hello, join this meeting heere using thiis lllink'

to be like this > 'hello, join this meeting here using this link'

knowing that I have this list of words to keep and ignore repetitive letters check: keepWord = ['Hello','meeting']

The two scripts I found useful are:

  • Using .join:

    import itertools
    
    sentence = ''.join(c[0] for c in itertools.groupby(sentence))
    
  • Using regex:

    import re
    
    sentence = re.compile(r'(.)\1{1,}').sub(r'\1', sentence)
    

I have a solution, but I think there's a more compacted and efficient one. My solution for now is:

import itertools

sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']

new_sentence = ''

for word in sentence.split():
    if word not in keepWord:
        new_word = ''.join(c[0] for c in itertools.groupby(word))
        new_sentence = sentence +" " + new_word
    else:
        new_sentence = sentence +" " + word

Any suggestions?


Solution

  • You may match the whole words from the keepWord list, and only replace sequences of two or more identical letters in other contexts:

    import re
    sentence = 'hello, join this meeting heere using thiis lllink'
    keepWord = ['hello','meeting']
    new_sentence = re.sub(fr"\b(?:{'|'.join(keepWord)})\b|([^\W\d_])\1+", lambda x: x.group(1) or x.group(), sentence)
    print(new_sentence)
    # => hello, join this meeting here using this link
    

    See the Python demo

    The regex will look like

    \b(?:hello|meeting)\b|([^\W\d_])\1+
    

    See the regex demo. If Group 1 matches, its value is returned, else, the full match (the word to keep) is put back.

    Pattern details

    • \b(?:hello|meeting)\b - hello or meeting enclosed with word boundaries
    • | - or
    • ([^\W\d_]) - Group 1: any Unicode letter
    • \1+ - one or more backreferences to Group 1 value