I saw plenty of suggestions to remove consecutively repeated letters in a sentence either using re (regex) or .join in python, but I want to have exception for special words.
E.g.:
I want this sentence > sentence = 'hello, join this meeting heere using thiis lllink'
to be like this > 'hello, join this meeting here using this link'
knowing that I have this list of words to keep and ignore repetitive letters check: keepWord = ['Hello','meeting']
The two scripts I found useful are:
Using .join:
import itertools
sentence = ''.join(c[0] for c in itertools.groupby(sentence))
Using regex:
import re
sentence = re.compile(r'(.)\1{1,}').sub(r'\1', sentence)
I have a solution, but I think there's a more compacted and efficient one. My solution for now is:
import itertools
sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']
new_sentence = ''
for word in sentence.split():
if word not in keepWord:
new_word = ''.join(c[0] for c in itertools.groupby(word))
new_sentence = sentence +" " + new_word
else:
new_sentence = sentence +" " + word
Any suggestions?
You may match the whole words from the keepWord
list, and only replace sequences of two or more identical letters in other contexts:
import re
sentence = 'hello, join this meeting heere using thiis lllink'
keepWord = ['hello','meeting']
new_sentence = re.sub(fr"\b(?:{'|'.join(keepWord)})\b|([^\W\d_])\1+", lambda x: x.group(1) or x.group(), sentence)
print(new_sentence)
# => hello, join this meeting here using this link
See the Python demo
The regex will look like
\b(?:hello|meeting)\b|([^\W\d_])\1+
See the regex demo. If Group 1 matches, its value is returned, else, the full match (the word to keep) is put back.
Pattern details
\b(?:hello|meeting)\b
- hello
or meeting
enclosed with word boundaries|
- or([^\W\d_])
- Group 1: any Unicode letter\1+
- one or more backreferences to Group 1 value