Search code examples
pythonreplaceemoji

Replace emoji with other text


I need to replace all emojis from text with the form ["emoji here"](emoji/1234567890). I wrote this code:

entities = [. . .] # ids for my emojies

emoji_pattern = re.compile(r"[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2702-\u27B0\u27BF-\u27FF\u2930-\u293F\u2980-\u29FF]")
emojis = [match.group() for match in re.finditer(emoji_pattern, text)]
emoji_dict = {emoji: [] for emoji in set(emojis)}
for i, emoji in enumerate(emojis):
    emoji_dict[emoji].append(i)
new_text = replace_emoji(emoji_dict, entities, text)


def replace_emoji(emoji_dict, entities, text):
    for emoji, indices in emoji_dict.items():
        for index in indices:
            text = re.sub(fr"{emoji}", f"[{emoji}](emoji/{entities[index]})", text)
    return text

emoji_dict looks something like this: {'🔤': [0], '🔹': [1, 2, 3, 4, 5]} where the numbers are the index of the value from the entities list

If an emoji occurs in the text only once (as in the case of 🔤), then everything is displayed correctly: [🔤](emoji/1234567890), but if an emoji occurs several times (as in the case of 🔹), then this is displayed like this: [[🔹](emoji/5235873473821159415)](emoji/5235851187235861094)[[🔹](emoji/5235873473821159415)](emoji/5235851187235861094)

Tell me how can I fix this error?

Example:

example text

text = '''Hello, #️⃣ user #️⃣ How's your day going? 😄 I hope everything is going great for you! 👌 If you have any questions, feel free to ask. I'm here to help! 🫰'''

. . .

new_text = '''Hello, [#️⃣](emoji/12352352340) user [#️⃣](emoji/12352352340) How's your day going? [😄](emoji/1245531421) I hope everything is going great for you! [👌](emoji/523424120) If you have any questions, feel free to ask. I'm here to help! [🫰](emoji/90752893562)'''

Solution

  • When you do

    for index in indices:
        text = re.sub(..., text)
    

    The first iteration replaces the emoji with f'[{emoji}](emoji/{indices[0]})'. Then the second iteration replaces the emoji inside the [] with f'[{emoji}](emoji/{indices[1]})', and so on, so you get a series of nested replacements. You don't want to replace inside a previous replacement.

    In your desired output, you use the same entity for all the repetitions of an emoji. So there's no need to make a list of indices for each emoji, or loop over them when making the replacements. emoji_dict should just have one index for each emoji, and you can replace all of them with the corresponding entity.

    import re
    
    text = "He😄 llo, #️⃣ user #️⃣ How's your day going? 😄 I hope everything is going great for you! 👌 If you have any questions, feel free to ask. I'm here to help! 🫰"
    entities = [12345, 67890, 23456, 78901] # ids for my emojies
    
    def replace_emoji(emoji_dict, entities, text):
        for emoji, index in emoji_dict.items():
            text = re.sub(fr"{emoji}", f"[{emoji}](emoji/{entities[index]})", text)
        return text
    
    emoji_pattern = re.compile(r"[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2702-\u27B0\u27BF-\u27FF\u2930-\u293F\u2980-\u29FF]")
    emojis = re.findall(emoji_pattern, text)
    emoji_dict = {}
    for i, emoji in enumerate(set(emojis)):
        emoji_dict[emoji] = i
    new_text = replace_emoji(emoji_dict, entities, text)
    
    print(new_text)
    

    output:

    He[😄](emoji/67890) llo, #️⃣ user #️⃣ How's your day going? [😄](emoji/67890) I hope everything is going great for you! [👌](emoji/12345) If you have any questions, feel free to ask. I'm here to help! 🫰
    

    #️⃣ and 🫰 are not replaced because they aren't matched by the regexp.