r regex data-cleaning sentiment-analysis emoticons

replace_emoticon function incorrectly replaces characters within a word - R

I am working in R and using the replace_emoticon function from the textclean package to replace emoticons with their corresponding words:

library(textclean)
test_text <- "i had a great experience xp :P"
replace_emoticon(test_text)

[1] "i had a great e tongue sticking out erience tongue sticking out tongue sticking out "

As seen above, the function works but it also replaces characters that look like an emoticon but are within a word (for example the "xp" in "experience"). I have tried to find a solution for this issue and found the following function-overwrite that claims to fix this issue:

 replace_emoticon <- function(x, emoticon_dt = lexicon::hash_emoticons, ...){

     trimws(gsub(
         "\\s+", 
         " ", 
         mgsub_regex(x, paste0('\\b\\Q', emoticon_dt[['x']], '\\E\\b'), paste0(" ", emoticon_dt[['y']], " "))
     ))

 }

replace_emoticon(test_text)

[1] "i had a great experience tongue sticking out :P"

However, while it does solve the issue with the word "experience", it creates a whole new issue: it stops replacing the ":P" - which is an Emoticon and should normally get replaced by the function.

Furthermore, the error is known with the characters "xp", but I am not sure whether there are other characters except for "xp" that also get replaced incorrectly while they are part of a word.

Is there a solution to tell the replace_emoticon function to only replace "emoticons" when they are not part of a word?

Thank you!

Solution

Wiktor is right, the word boundery check is causing an issue. I have adjusted it slightly in the below function. There is still 1 issue with this and that is if the emoticon is immediately followed by a word without a space between the emoticon and the word. The question is if the last issue is important or not. See examples below.

Note: I added this info to the issue tracker with textclean.

replace_emoticon2 <- function(x, emoticon_dt = lexicon::hash_emoticons, ...){
  trimws(gsub(
    "\\s+", 
    " ", 
    mgsub_regex(x, paste0('\\Q', emoticon_dt[['x']], '\\E\\b'), paste0(" ", emoticon_dt[['y']], " "))
  ))
}

# works
replace_emoticon2("i had a great experience xp :P")
[1] "i had a great experience tongue sticking out tongue sticking out"
replace_emoticon2("i had a great experiencexp:P:P")
[1] "i had a great experience tongue sticking out tongue sticking out tongue sticking out"


# does not work:
replace_emoticon2("i had a great experience xp :Pnewword")
[1] "i had a great experience tongue sticking out :Pnewword"

New function added:

Based on stringi and the regex escaping function from wiktor from this post

replace_emoticon_new <- function (x, emoticon_dt = lexicon::hash_emoticons, ...) 
{
  regex_escape <- function(string) {
    gsub("([][{}()+*^${|\\\\?.])", "\\\\\\1", string)
  }

  stringi::stri_replace_all(x, 
                            regex = paste0("\\s+", regex_escape(emoticon_dt[["x"]])),
                            replacement = paste0(" ", emoticon_dt[['y']]),   
                            vectorize_all = FALSE)
}

test_text <- "Hello :) Great experience! xp :) :P"
replace_emoticon_new(test_text)
[1] "Hello smiley Great experience! tongue sticking out smiley tongue sticking out"