I am using the following code for cleaning my text
def clean_str(s):
"""Clean sentence"""
s = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", s)
s = re.sub(r"\'s", " \'s", s)
s = re.sub(r"\'ve", " \'ve", s)
s = re.sub(r"n\'t", " n\'t", s)
s = re.sub(r"\'re", " \'re", s)
s = re.sub(r"\'d", " \'d", s)
s = re.sub(r"\'ll", " \'ll", s)
s = re.sub(r",", " , ", s)
s = re.sub(r"!", " ! ", s)
s = re.sub(r"\(", " ", s)
s = re.sub(r"\)", " ", s)
s = re.sub(r"\?", " ? ", s)
s = re.sub(r"\s{2,}", " ", s)
s = re.sub(r'\S*(x{2,}|X{2,})\S*',"xxx", s)
s = re.sub(r'[^\x00-\x7F]+', "", s)
return s.strip()
As you can see that I am removing parentheses and other special characters. Now, I want to keep the following patterns intact in my text and not remove them
:), :-), :( and :-(
Could anyone help me with this please?
thanks,
You should ask yourself what patterns match any chars from the smilies you want to "protect". You can easily see that r"[^A-Za-z0-9(),!?'`]"
, r"\("
and r"\)"
match these chars.
So, you may fix those patterns:
s = re.sub(r":-?[()]|([^A-Za-z0-9(),!?'`])", lambda x: " " if x.group(1) else x.group(), s) # Match smilies and match and capture what you need to replace
s = re.sub(r"(?<!:)(?<!:-)\(", " ", s) # Prepend (?<!:)(?<!:-) lookbehinds
s = re.sub(r"(?<!:)(?<!:-)\)", " ", s) # Prepend (?<!:)(?<!:-) lookbehinds
The :-?[()]|([^A-Za-z0-9(),!?'`])
pattern matches a smiley to protect (:-?[()]
matches a :
, then an optional -
and then a (
or )
) or matches and captures into Group 1 any char other than the one defined in the negated character class. The lambda x: " " if x.group(1) else x.group()
lambda expression implements a custom replacement logic depending on a group match: if Group 1 matched, the replacement occurs, else, the smiley is put back where it was.
The (?<!:)(?<!:-)
negative lookbehinds make sure (
and )
are not matched if they are prepended with :
or :-
.
Note r'\S*(x{2,}|X{2,})\S*'
can also match the smilies if they are glued to the xx
or XX
. However, fixing this one is tricky since :(
like smilies might be matched with \S*
if they are not at the start of the non-whitespace chunk, so, you may use
s = re.sub(r'(:-[()])|(?:(?!:-?[()])\S)*(?:x{2,}|X{2,})(?:(?!:-?[()])\S)*',"xxx" if x.group(1) else x.group(), s)
The tactics is similar to r":-?[()]|([^A-Za-z0-9(),!?'`])"
pattern, we match and capture the smiley, but then we only allow matching such non-whitespace chars (\S
) that do not start the smiley substring ((?:(?!:-?[()])\S)*
).