python regex data-processing emoticons unix-text-processing

unbalanced parenthesis regex

!pip install emot
from emot.emo_unicode import EMOTICONS_EMO
def convert_emoticons(text):
    for emot in EMOTICONS_EMO:
        text = re.sub(u'\('+emot+'\)', "_".join(EMOTICONS_EMO[emot].replace(",","").split()), text)
        return text

text = "Hello :-) :-)"
convert_emoticons(text)

I'm trying to run the above code in google collab, but it gives the following error: unbalanced parenthesis at position 4

My undesrtanding from the re module documentation tells that '\(any_expression'\)' is correct way to use, but I still get the error. So, I'have tried replacing '\(' + emot + '\) with:

'(' + emot + ')', it gives the same error
'[' + emot + ']', it gives the following output: Hello Happy_face_or_smiley-Happy_face_or_smiley Happy_face_or_smiley-Happy_face_or_smiley

The correct output should be Hello Happy_face_smiley Happy_face_smiley for text = "Hello :-) :-)"

Can someone help me fix the problem?

Solution

This is pretty tricky using regex, as you'd first need to escape the metachars in the regex that are contained in the emoji, such as :) and :(, which is why you get the unbalanced parens. So, you'd need to do something like this first:

>>> print(re.sub(r'([()...])', r'%s\1' % '\\\\', ':)'))
:\)

But I'd suggest just doing a straight replacement since you already have a mapping that you're iterating through it. So we'd have:

from emot.emo_unicode import EMOTICONS_EMO
def convert_emoticons(text):
    for emot in EMOTICONS_EMO:
        text = text.replace(emot, EMOTICONS_EMO[emot].replace(" ","_"))
    return text


text = "Hello :-) :-)"
convert_emoticons(text)
# 'Hello Happy_face_smiley Happy_face_smiley'