I am looking at twitter data which I am then feeding into an html document. Often the text contains special characters like emojis that aren't properly encoded for html. For example the tweet:
If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be 🔥 🔥 🔥
would become:
If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be 🔥 🔥 🔥
when fed into an html document.
Working manually I could use a tool like https://www.textfixer.com/html/html-character-encoding.php to encode the tweet to look like:
If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be "�";"�"; "�";"�"; "�";"�";
which I could then feed to an html document and have the emojis show up. Is there a package or function in R that could take text and html encode it similarly to the web tool above?
Here's a function which will encode non-ascii characters as HTML entities.
entity_encode <- function(x) {
cp <- utf8ToInt(x)
rr <- vector("character", length(cp))
ucp <- cp>128
rr[ucp] <- paste0("&#", as.character(cp[ucp]), ";")
rr[!ucp] <- sapply(cp[!ucp], function(z) rawToChar(as.raw(z)))
paste0(rr, collapse="")
}
This returns
[1] "If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be 🔥 🔥 🔥"
for your input but those seem to be equivalent encodings.