Search code examples
python-3.xgmail-api

Remove non alphanumeric but preserve punctuation


I am calling Gmail APIs to get the title of emails. Some of the titles contain non alphanumeric characters such as emojis, "'" sign, and so one (example: '\u201cEthnographic'). At the same time I need to preserve the punctuations at the end of the words: for example Hello! needs to be preserved. I've seen many code samples on how to get rid of non-alphanumeric but haven't been able to accomplish what I'm trying to do. Any feedback is appreciated.

# Call the api and get the emails
M = json.dumps(message)

temp = message['messages'][0]['payload']

num_found = 0
# get the subject of the emails
for header in temp['headers']:
    # print(header['name'])
    if header['name'] == 'Subject':
        subject = header['value']
        break   

# S contains patterns like "\u201cEthnographic ..."
# or "u2b50\ufe0f best of .."
S = json.dumps(subject)

Solution

  • Have you looked at the emoji python package?

    ref: emoji package documentation

    import emoji
    
    def emoji_free(input):
      allchars = [str for str in input]
      emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
      clean_text = ' '.join([str for str in input.split() if not any(i in str for i in emoji_list)])
      return clean_text
    
    emoji_message = 'This is an emoji 🙃 and the code is designed to remove 😎 emojis from a string.'
    
    # remove the emojis from the message
    clean_message = emoji_free(emoji_message)
    
    print (clean_message)
    # output
    # This is an emoji and the code is designed to remove emojis from a string.
    
    emoji_message = 'You are a bright \u2b50 with a smiling face \u263A'
    print (emoji_message)
    # output 
    # You are a bright ⭐ with a smiling face ☺
    
    clean_message = emoji_free(emoji_message)
    print (clean_message)
    # output 
    # You are a bright with a smiling face
    

    Here is another way to remove the unicode string related to an emoji.

    import re
    
    emoji_pattern = re.compile("["
                u"\U0001F600-\U0001F64F"  # emoticons
                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                u"\U0001F680-\U0001F6FF"  # transport & map symbols
                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                u"\U00002702-\U000027B0"
                u"\U000024C2-\U0001F251"
                u"\U0001f926-\U0001f937"
                u'\U00010000-\U0010ffff'
                u"\u200d"
                u"\u2640-\u2642"
                u"\u2600-\u2B55"
                u"\u23cf"
                u"\u23e9"
                u"\u231a"
                u"\u3030"
                u"\ufe0f"
    "]+", flags=re.UNICODE)
    
    # message with star emoji in unicode
    emoji_message = 'You are a bright \u2b50'
    
    # print message with star emoji
    print(emoji_message)
    # output 
    # You are a bright ⭐
    
    # print message without star emoji
    print(emoji_pattern.sub(r'', emoji_message)) 
    # output 
    # You are a bright