Search code examples
pythonregextext-mining

Regex - Match certain patterns while excluding others?


I have text data that I want to clean (i.e. keep only alphanumeric characters) with Python. However, most of the text data I encounter contain emoji(s). I want to clean the text from non-alphanumerics, but still keep the emoji.

First, I used the emoji library in Python to convert each emoji in a text to a certain string pattern to make it distinguishable. An example of an emoji that has been "demojized" (a literal function in the library) is shown below:

':smiley_face:' # a "demojized" emoji.

After scrolling through the data, I find that these emojis (once "demojized") exhibit the same pattern, which in regex terms seems to be

':[a-z_]+:' # regex for matching emojis.

Ok, so I know the pattern for emojis and I can extract every emoji from the text data I have. The problem is, I want to clean the text data from non-alphanumerics without altering the emoji pattern simultaneously. My initial attempt to clean the data:

>>> text = 'Wow.. :smiley_face: this is delicious!' # A string containing emoji
>>> cleaned_text = re.sub('[^a-zA-Z0-9]+',' ',text) # regex to keep only alphanumerics
>>> print(cleaned_text)
Wow smiley face this is delicious

Clearly this isn't my desired output. I want to keep the emoji text intact, as shown below:

'Wow :smiley_face: this is delicious' # Desired output

So far I have looked into things like lookahead assertion, but to no avail. Is it possible with regex to remove non-alphanumerics whilst excluding the ':[a-z_]+:' pattern from the match? Apologies if question is unclear.


Solution

  • If you just want to remove all special chars except the colons and underscores inside colon-word(s)-colon contexts, you can use

    re.sub(r'(:[a-z_]+:)|[^\w\s]|_', r'\1', text)
    

    See the regex demo. Details:

    • (:[a-z_]+:) - Capturing group 1 (\1): :, one or more lowercase ASCII letters or _, and a :
    • | - or
    • [^\w\s]|_ - any char other than a word and whitespace char or a _ (it is a word char, hence it needs to be added as an alternative).

    See the Python demo:

    import re
    text = 'Wow.. :smiley_face: this is delicious!' # A string containing emoji
    print( re.sub(r'(:[a-z_]+:)|[^\w\s]|_', r'\1', text) )
    # => Wow :smiley_face: this is delicious