Search code examples
pythonregextextspecial-charactersemoticons

keeping smileys/emoticons while removing special characters using regex python


I am using the following code for cleaning my text

def clean_str(s):
"""Clean sentence"""
  s = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", s)
  s = re.sub(r"\'s", " \'s", s)
  s = re.sub(r"\'ve", " \'ve", s)
  s = re.sub(r"n\'t", " n\'t", s)
  s = re.sub(r"\'re", " \'re", s)
  s = re.sub(r"\'d", " \'d", s)
  s = re.sub(r"\'ll", " \'ll", s)
  s = re.sub(r",", " , ", s)
  s = re.sub(r"!", " ! ", s)
  s = re.sub(r"\(", " ", s)
  s = re.sub(r"\)", " ", s)
  s = re.sub(r"\?", " ? ", s)
  s = re.sub(r"\s{2,}", " ", s)
  s = re.sub(r'\S*(x{2,}|X{2,})\S*',"xxx", s)
  s = re.sub(r'[^\x00-\x7F]+', "", s)
  return s.strip()

As you can see that I am removing parentheses and other special characters. Now, I want to keep the following patterns intact in my text and not remove them

:), :-), :( and :-(

Could anyone help me with this please?

thanks,


Solution

  • You should ask yourself what patterns match any chars from the smilies you want to "protect". You can easily see that r"[^A-Za-z0-9(),!?'`]", r"\(" and r"\)" match these chars.

    So, you may fix those patterns:

    s = re.sub(r":-?[()]|([^A-Za-z0-9(),!?'`])", lambda x: " " if x.group(1) else x.group(), s) # Match smilies and match and capture what you need to replace
    s = re.sub(r"(?<!:)(?<!:-)\(", " ", s) # Prepend (?<!:)(?<!:-) lookbehinds
    s = re.sub(r"(?<!:)(?<!:-)\)", " ", s) # Prepend (?<!:)(?<!:-) lookbehinds
    

    The :-?[()]|([^A-Za-z0-9(),!?'`]) pattern matches a smiley to protect (:-?[()] matches a :, then an optional - and then a ( or )) or matches and captures into Group 1 any char other than the one defined in the negated character class. The lambda x: " " if x.group(1) else x.group() lambda expression implements a custom replacement logic depending on a group match: if Group 1 matched, the replacement occurs, else, the smiley is put back where it was.

    The (?<!:)(?<!:-) negative lookbehinds make sure ( and ) are not matched if they are prepended with : or :-.

    Note r'\S*(x{2,}|X{2,})\S*' can also match the smilies if they are glued to the xx or XX. However, fixing this one is tricky since :( like smilies might be matched with \S* if they are not at the start of the non-whitespace chunk, so, you may use

    s = re.sub(r'(:-[()])|(?:(?!:-?[()])\S)*(?:x{2,}|X{2,})(?:(?!:-?[()])\S)*',"xxx" if x.group(1) else x.group(), s)
    

    The tactics is similar to r":-?[()]|([^A-Za-z0-9(),!?'`])" pattern, we match and capture the smiley, but then we only allow matching such non-whitespace chars (\S) that do not start the smiley substring ((?:(?!:-?[()])\S)*).