Search code examples
remojitopic-modelingdata-preprocessing

How can I replace emojis with text and treat them as single words?


I have to do a topic modeling based on pieces of texts containing emojis with R. Using the replace_emoji() and replace_emoticon functions let me analyze them, but there is a problem with the results.

A red heart emoji is translated as "red heart ufef". These words are then treated separately during the analysis and compromise the results.

Terms like "heart" can have a very different meaning as can be seen with "red heart ufef" and "broken heart" The function replace_emoji_identifier() doesn't help either, as the identifiers make an analysis hard.

Dummy data set reproducible with by using dput() (including the step force to lowercase:

Emoji_struct <- c(
      list(content = "๐Ÿ”ฅ๐Ÿ”ฅ wow", "๐Ÿ˜ฎ look at that", "๐Ÿ˜คthis makes me angry๐Ÿ˜ค", "๐Ÿ˜โค\ufe0f, i love it!"),  
      list(content = "๐Ÿ˜๐Ÿ˜", "๐Ÿ˜Š thanks for helping",  "๐Ÿ˜ข oh no, why? ๐Ÿ˜ข", "careful, challenging โŒโŒโŒ")
)

Current coding (data_orig is a list of several files):

library(textclean)
#The rest should be standard r packages for pre-processing

#pre-processing:
data <- gsub("'", "", data) 
data <- replace_contraction(data)
data <- replace_emoji(data) # replace emoji with words
data <- replace_emoticon(data) # replace emoticon with words
data <- replace_hash(data, replacement = "")
data <- replace_word_elongation(data)
data <- gsub("[[:punct:]]", " ", data)  #replace punctuation with space
data <- gsub("[[:cntrl:]]", " ", data) 
data <- gsub("[[:digit:]]", "", data)  #remove digits
data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
data <- stripWhitespace(data)

Desired output:

[1] list(content = c("fire fire wow", 
                     "facewithopenmouth look at that", 
                     "facewithsteamfromnose this makes me angry facewithsteamfromnose", 
                     "smilingfacewithhearteyes redheart \ufe0f, i love it!"), 
         content = c("smilingfacewithhearteyes smilingfacewithhearteyes", 
                     "smilingfacewithsmilingeyes thanks for helping", 
                     "cryingface oh no, why? cryingface", 
                     "careful, challenging crossmark crossmark crossmark"))

Any ideas? Lower cases would work, too. Best regards. Stay safe. Stay healthy.


Solution

  • Answer

    Replace the default conversion table in replace_emoji with a version where the spaces/punctuation is removed:

    hash2 <- lexicon::hash_emojis
    hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)
    
    replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
    

    Example

    Single character string:

    replace_emoji("wow!๐Ÿ˜ฎ that is cool!", emoji_dt = hash2)
    #[1] "wow! facewithopenmouth that is cool!"
    

    Character vector:

    replace_emoji(c("1: ๐Ÿ˜Š", "2: ๐Ÿ˜"), emoji_dt = hash2)
    #[1] "1: smilingfacewithsmilingeyes "
    #[2] "2: smilingfacewithhearteyes "
    

    List:

    list("list_element_1: ๐Ÿ”ฅ", "list_element_2: โŒ") %>%
      lapply(replace_emoji, emoji_dt = hash2)
    #[[1]]
    #[1] "list_element_1: fire "
    #
    #[[2]]
    #[1] "list_element_2: crossmark "
    

    Rationale

    To convert emojis to text, replace_emoji uses lexicon::hash_emojis as a conversion table (a hash table):

    head(lexicon::hash_emojis)
    #              x                        y
    #1: <e2><86><95>            up-down arrow
    #2: <e2><86><99>          down-left arrow
    #3: <e2><86><a9> right arrow curving left
    #4: <e2><86><aa> left arrow curving right
    #5: <e2><8c><9a>                    watch
    #6: <e2><8c><9b>           hourglass done
    

    This is an object of class data.table. We can simply modify the y column of this hash table so that we remove all the spaces and punctuation. Note that this also allows you to add new ASCII byte representations and an accompanying string.