Search code examples
rregexdata-cleaningsentiment-analysisemoticons

replace_emoticon function incorrectly replaces characters within a word - R


I am working in R and using the replace_emoticon function from the textclean package to replace emoticons with their corresponding words:

library(textclean)
test_text <- "i had a great experience xp :P"
replace_emoticon(test_text)

[1] "i had a great e tongue sticking out erience tongue sticking out tongue sticking out "

As seen above, the function works but it also replaces characters that look like an emoticon but are within a word (for example the "xp" in "experience"). I have tried to find a solution for this issue and found the following function-overwrite that claims to fix this issue:

 replace_emoticon <- function(x, emoticon_dt = lexicon::hash_emoticons, ...){

     trimws(gsub(
         "\\s+", 
         " ", 
         mgsub_regex(x, paste0('\\b\\Q', emoticon_dt[['x']], '\\E\\b'), paste0(" ", emoticon_dt[['y']], " "))
     ))

 }

replace_emoticon(test_text)

[1] "i had a great experience tongue sticking out :P"

However, while it does solve the issue with the word "experience", it creates a whole new issue: it stops replacing the ":P" - which is an Emoticon and should normally get replaced by the function.

Furthermore, the error is known with the characters "xp", but I am not sure whether there are other characters except for "xp" that also get replaced incorrectly while they are part of a word.

Is there a solution to tell the replace_emoticon function to only replace "emoticons" when they are not part of a word?

Thank you!


Solution

  • Wiktor is right, the word boundery check is causing an issue. I have adjusted it slightly in the below function. There is still 1 issue with this and that is if the emoticon is immediately followed by a word without a space between the emoticon and the word. The question is if the last issue is important or not. See examples below.

    Note: I added this info to the issue tracker with textclean.

    replace_emoticon2 <- function(x, emoticon_dt = lexicon::hash_emoticons, ...){
      trimws(gsub(
        "\\s+", 
        " ", 
        mgsub_regex(x, paste0('\\Q', emoticon_dt[['x']], '\\E\\b'), paste0(" ", emoticon_dt[['y']], " "))
      ))
    }
    
    # works
    replace_emoticon2("i had a great experience xp :P")
    [1] "i had a great experience tongue sticking out tongue sticking out"
    replace_emoticon2("i had a great experiencexp:P:P")
    [1] "i had a great experience tongue sticking out tongue sticking out tongue sticking out"
    
    
    # does not work:
    replace_emoticon2("i had a great experience xp :Pnewword")
    [1] "i had a great experience tongue sticking out :Pnewword"
    

    New function added:

    Based on stringi and the regex escaping function from wiktor from this post

    replace_emoticon_new <- function (x, emoticon_dt = lexicon::hash_emoticons, ...) 
    {
      regex_escape <- function(string) {
        gsub("([][{}()+*^${|\\\\?.])", "\\\\\\1", string)
      }
    
      stringi::stri_replace_all(x, 
                                regex = paste0("\\s+", regex_escape(emoticon_dt[["x"]])),
                                replacement = paste0(" ", emoticon_dt[['y']]),   
                                vectorize_all = FALSE)
    }
    
    test_text <- "Hello :) Great experience! xp :) :P"
    replace_emoticon_new(test_text)
    [1] "Hello smiley Great experience! tongue sticking out smiley tongue sticking out"