python regex text special-characters emoticons

keeping smileys/emoticons while removing special characters using regex python

I am using the following code for cleaning my text

def clean_str(s):
"""Clean sentence"""
  s = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", s)
  s = re.sub(r"\'s", " \'s", s)
  s = re.sub(r"\'ve", " \'ve", s)
  s = re.sub(r"n\'t", " n\'t", s)
  s = re.sub(r"\'re", " \'re", s)
  s = re.sub(r"\'d", " \'d", s)
  s = re.sub(r"\'ll", " \'ll", s)
  s = re.sub(r",", " , ", s)
  s = re.sub(r"!", " ! ", s)
  s = re.sub(r"\(", " ", s)
  s = re.sub(r"\)", " ", s)
  s = re.sub(r"\?", " ? ", s)
  s = re.sub(r"\s{2,}", " ", s)
  s = re.sub(r'\S*(x{2,}|X{2,})\S*',"xxx", s)
  s = re.sub(r'[^\x00-\x7F]+', "", s)
  return s.strip()

As you can see that I am removing parentheses and other special characters. Now, I want to keep the following patterns intact in my text and not remove them

:), :-), :( and :-(

Could anyone help me with this please?

thanks,

Solution

You should ask yourself what patterns match any chars from the smilies you want to "protect". You can easily see that r"[^A-Za-z0-9(),!?'`]", r"\(" and r"\)" match these chars.

So, you may fix those patterns:

s = re.sub(r":-?[()]|([^A-Za-z0-9(),!?'`])", lambda x: " " if x.group(1) else x.group(), s) # Match smilies and match and capture what you need to replace
s = re.sub(r"(?<!:)(?<!:-)\(", " ", s) # Prepend (?<!:)(?<!:-) lookbehinds
s = re.sub(r"(?<!:)(?<!:-)\)", " ", s) # Prepend (?<!:)(?<!:-) lookbehinds

The :-?[()]|([^A-Za-z0-9(),!?'`]) pattern matches a smiley to protect (:-?[()] matches a :, then an optional - and then a ( or )) or matches and captures into Group 1 any char other than the one defined in the negated character class. The lambda x: " " if x.group(1) else x.group() lambda expression implements a custom replacement logic depending on a group match: if Group 1 matched, the replacement occurs, else, the smiley is put back where it was.

The (?<!:)(?<!:-) negative lookbehinds make sure ( and ) are not matched if they are prepended with : or :-.

Note r'\S*(x{2,}|X{2,})\S*' can also match the smilies if they are glued to the xx or XX. However, fixing this one is tricky since :( like smilies might be matched with \S* if they are not at the start of the non-whitespace chunk, so, you may use

s = re.sub(r'(:-[()])|(?:(?!:-?[()])\S)*(?:x{2,}|X{2,})(?:(?!:-?[()])\S)*',"xxx" if x.group(1) else x.group(), s)

The tactics is similar to r":-?[()]|([^A-Za-z0-9(),!?'`])" pattern, we match and capture the smiley, but then we only allow matching such non-whitespace chars (\S) that do not start the smiley substring ((?:(?!:-?[()])\S)*).