Search code examples
rrtweet

HTML encode text in R


I am looking at twitter data which I am then feeding into an html document. Often the text contains special characters like emojis that aren't properly encoded for html. For example the tweet:

If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be 🔥 🔥 🔥

would become:

If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be 🔥 🔥 🔥

when fed into an html document.

Working manually I could use a tool like https://www.textfixer.com/html/html-character-encoding.php to encode the tweet to look like:

If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be "&#55357";"&#56613"; "&#55357";"&#56613"; "&#55357";"&#56613";

which I could then feed to an html document and have the emojis show up. Is there a package or function in R that could take text and html encode it similarly to the web tool above?


Solution

  • Here's a function which will encode non-ascii characters as HTML entities.

    entity_encode <- function(x) {
      cp <- utf8ToInt(x)
      rr <- vector("character", length(cp))
      ucp <- cp>128
      rr[ucp] <- paste0("&#", as.character(cp[ucp]), ";")
      rr[!ucp] <- sapply(cp[!ucp], function(z) rawToChar(as.raw(z)))
      paste0(rr, collapse="")
    }
    

    This returns

    [1] "If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be &#128293; &#128293; &#128293;"
    

    for your input but those seem to be equivalent encodings.