I need to replace all emojis from text with the form ["emoji here"](emoji/1234567890)
. I wrote this code:
entities = [. . .] # ids for my emojies
emoji_pattern = re.compile(r"[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2702-\u27B0\u27BF-\u27FF\u2930-\u293F\u2980-\u29FF]")
emojis = [match.group() for match in re.finditer(emoji_pattern, text)]
emoji_dict = {emoji: [] for emoji in set(emojis)}
for i, emoji in enumerate(emojis):
emoji_dict[emoji].append(i)
new_text = replace_emoji(emoji_dict, entities, text)
def replace_emoji(emoji_dict, entities, text):
for emoji, indices in emoji_dict.items():
for index in indices:
text = re.sub(fr"{emoji}", f"[{emoji}](emoji/{entities[index]})", text)
return text
emoji_dict
looks something like this: {'🔤': [0], '🔹': [1, 2, 3, 4, 5]}
where the numbers are the index of the value from the entities
list
If an emoji occurs in the text only once (as in the case of 🔤), then everything is displayed correctly: [🔤](emoji/1234567890)
, but if an emoji occurs several times (as in the case of 🔹), then this is displayed like this: [[🔹](emoji/5235873473821159415)](emoji/5235851187235861094)[[🔹](emoji/5235873473821159415)](emoji/5235851187235861094)
Tell me how can I fix this error?
Example:
text = '''Hello, #️⃣ user #️⃣ How's your day going? 😄 I hope everything is going great for you! 👌 If you have any questions, feel free to ask. I'm here to help! 🫰'''
. . .
new_text = '''Hello, [#️⃣](emoji/12352352340) user [#️⃣](emoji/12352352340) How's your day going? [😄](emoji/1245531421) I hope everything is going great for you! [👌](emoji/523424120) If you have any questions, feel free to ask. I'm here to help! [🫰](emoji/90752893562)'''
When you do
for index in indices:
text = re.sub(..., text)
The first iteration replaces the emoji with f'[{emoji}](emoji/{indices[0]})'
. Then the second iteration replaces the emoji inside the []
with f'[{emoji}](emoji/{indices[1]})'
, and so on, so you get a series of nested replacements. You don't want to replace inside a previous replacement.
In your desired output, you use the same entity for all the repetitions of an emoji. So there's no need to make a list of indices for each emoji, or loop over them when making the replacements. emoji_dict
should just have one index for each emoji, and you can replace all of them with the corresponding entity.
import re
text = "He😄 llo, #️⃣ user #️⃣ How's your day going? 😄 I hope everything is going great for you! 👌 If you have any questions, feel free to ask. I'm here to help! 🫰"
entities = [12345, 67890, 23456, 78901] # ids for my emojies
def replace_emoji(emoji_dict, entities, text):
for emoji, index in emoji_dict.items():
text = re.sub(fr"{emoji}", f"[{emoji}](emoji/{entities[index]})", text)
return text
emoji_pattern = re.compile(r"[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2702-\u27B0\u27BF-\u27FF\u2930-\u293F\u2980-\u29FF]")
emojis = re.findall(emoji_pattern, text)
emoji_dict = {}
for i, emoji in enumerate(set(emojis)):
emoji_dict[emoji] = i
new_text = replace_emoji(emoji_dict, entities, text)
print(new_text)
output:
He[😄](emoji/67890) llo, #️⃣ user #️⃣ How's your day going? [😄](emoji/67890) I hope everything is going great for you! [👌](emoji/12345) If you have any questions, feel free to ask. I'm here to help! 🫰
#️⃣ and 🫰 are not replaced because they aren't matched by the regexp.