Search code examples
stringrtextgsubemoticons

Remove punctuation but keeping emoticons?


Is that possible to remove all the punctuations but keeping the emoticons such as

:-(

:)

:D

:p

structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label =     c("ãããæããããéãããæãããInappropriate announce:-(", 
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something     you are working to fix?", 
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)", 
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D", 
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...", 
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", 
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", 
"created"), class = "data.frame", row.names = c(NA, -6L))

Solution

  • Here's an approach that is less sophisticated and likely slower than @gagolews's solution. It requires you feed it an emoticon dictionary. You can create that or use the one in the qdapDictionaries package. The basic approach converts the emoticons to text that couldn't be mistaken for anything else (I use dat$Temp <- prefix to ensure this). Then you strip out punctuation using qdap::strip and then convert the placeholders back into emoticons via mgsub:

    library(qdap)
    #reps <- emoticon
    emos <- c(":-(", ":)", ":D", ":p", "X-(")
    reps <- data.frame(seq_along(emos), emos)
    
    reps[, 1] <- paste0("EMOTICONREPLACE", reps[, 1])
    dat$Temp <- mgsub(as.character(reps[, 2]), reps[, 1], dat[, 1])
    dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]), 
        strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE))
    

    View it:

    truncdf(left_just(dat[, 3, drop=F]), 50)
    
    ##   Temp                                              
    ## 1 RT AirAsia ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í No
    ## 2 You know there is a problem when customer service 
    ## 3 ãããæããããéãããæãããInappropriate announce:-(         
    ## 4 AirAsia your direct debit Maybank payment gateways
    ## 5 xdek ke flight AirAsia Malaysia to LA hahah:p bagi
    ## 6 AirAsia Apart from the slight delay and shortage o
    

    EDIT: To keep the ? and ! as requested pass the char.keep argument in strip function:

    dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]), 
        strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))