Search code examples
runicodeemojidata-cleaningsentiment-analysis

Replace Emojis in R with replace_emoji() function does not work due to different encoding - UTF8/Unicode?


I am trying to clean my text data and replace Emojis with words so that I can perform a sentiment analysis later on.

Therefore, I am using the replace_emoji function from the textclean package. This should replace all emojis with their corresponding words.

The dataset I am working with is a text corpus, that is also the reason why I used the VCorpus function from the tm package in my sample code below:

text <- "text goes here bla bla <u+0001f926><u+0001f3fd><u+200d><u+2640><u+fe0f>" #text with emojis

text.corpus <- VCorpus(VectorSource(text)) #Transforming into corpus
text.corpus <- tm_map(text.corpus, content_transformer(function(x) replace_emoji(x, emoji_dt = lexicon::hash_emojis)))  #This function should change Emojis into words

inspect(text.corpus[[1]]) #inspecting the corpus shows that the Unicode was NOT replaced with words

head(hash_emojis) #This shows that the encoding in the lexicon is different than the encoding in my text data. 

Although the function itself works, it does not replace emojis in my text as it seems that the Encoding within the "hash_emojis" dataset is different than the one I have in my data. Thus, the function does not replace the Emojis into words. I have also tried to convert the "hash_emojis" data by using the iconv function but unfortunately did not manage to change the encoding.

I would like to replace the Unicode values are shown in my dataset with words.


Solution

  • I found an answer to your question. I will mark this one as a duplicate later today when you read my answer.

    Using my example:

    library(stringi)
    library(magrittr)
    
    "text goes here bla bla <u+0001F600><u+0001f602>"  %>% 
      stri_replace_all_regex("<u\\+([[:alnum:]]{4})>", "\\\\u$1") %>% 
      stri_replace_all_regex("<u\\+([[:alnum:]]{5})>", "\\\\U000$1") %>% 
      stri_replace_all_regex("<u\\+([[:alnum:]]{6})>", "\\\\U00$1") %>% 
      stri_replace_all_regex("<u\\+([[:alnum:]]{7})>", "\\\\U0$1") %>% 
      stri_replace_all_regex("<u\\+([[:alnum:]]{8})>", "\\\\U$1") %>% 
      stri_replace_all_regex("<u\\+([[:alnum:]]{1})>", "\\\\u000$1") %>% 
      stri_replace_all_regex("<u\\+([[:alnum:]]{2})>", "\\\\u00$1") %>% 
      stri_replace_all_regex("<u\\+([[:alnum:]]{3})>", "\\\\u0$1") %>% 
      stri_unescape_unicode() %>% 
      stri_enc_toutf8() %>% 
      textclean::replace_emoji()
    
    [1] "text goes here bla bla grinning face face with tears of joy "
    

    Now be carefull of the unicode representation. The example answer has the "U" in upper case, I changed this to lower case "u" to reflect your example.

    To combine everything:

    # create a function to use within tm_map
    unicode_replacement <- function(text) {
      text %>% 
        stri_replace_all_regex("<u\\+([[:alnum:]]{4})>", "\\\\u$1") %>% 
        stri_replace_all_regex("<u\\+([[:alnum:]]{5})>", "\\\\U000$1") %>% 
        stri_replace_all_regex("<u\\+([[:alnum:]]{6})>", "\\\\U00$1") %>% 
        stri_replace_all_regex("<u\\+([[:alnum:]]{7})>", "\\\\U0$1") %>% 
        stri_replace_all_regex("<u\\+([[:alnum:]]{8})>", "\\\\U$1") %>% 
        stri_replace_all_regex("<u\\+([[:alnum:]]{1})>", "\\\\u000$1") %>% 
        stri_replace_all_regex("<u\\+([[:alnum:]]{2})>", "\\\\u00$1") %>% 
        stri_replace_all_regex("<u\\+([[:alnum:]]{3})>", "\\\\u0$1") %>% 
        stri_unescape_unicode() %>% 
        stri_enc_toutf8()
    }
    
    library(tm)
    library(textclean)
    text.corpus <- VCorpus(VectorSource(text)) #Transforming into corpus
    text.corpus <- tm_map(text.corpus, content_transformer(unicode_replacement))
    text.corpus <- tm_map(text.corpus, content_transformer(function(x) replace_emoji(x, emoji_dt = lexicon::hash_emojis)))  
    
    inspect(text.corpus[[1]]) 
    
    <<PlainTextDocument>>
    Metadata:  7
    Content:  chars: 92
    
    text goes here bla bla <f0><9f><a4><a6><f0><9f><8f><bd><e2><80><8d> female sign <ef><b8><8f>
    

    Now using your example you get the above outcome. Checking the emoji tables, your unicode examples do not appear in the table except for the female sign. But that is another issue. If I use "text goes here bla bla " the outcome is as expected.