Search code examples
rencodingcharacter-encoding

convert unicode characters in italic or bold to normal characters using R


I have this string:

string <- "Blah blah \U0001d617\U0001d622\U0001d63a\U0001d633\U0001d630\U0001d62d\U0001d62d \U0001d61a\U0001d631\U0001d626\U0001d624\U0001d62a\U0001d622\U0001d62d\U0001d62a\U0001d634\U0001d635 blah blah"

when I pass it to cat i get this:

cat("Blah blah \U0001d617\U0001d622\U0001d63a\U0001d633\U0001d630\U0001d62d\U0001d62d \U0001d61a\U0001d631\U0001d626\U0001d624\U0001d62a\U0001d622\U0001d62d\U0001d62a\U0001d634\U0001d635 blah blah")
> Blah blah π˜—π˜’π˜Ίπ˜³π˜°π˜­π˜­ 𝘚𝘱𝘦𝘀π˜ͺ𝘒𝘭π˜ͺ𝘴𝘡 blah blah

How can I convert the string into this:

> "Blah blah Payroll Specialist blah blah" 

I have looked at this post: R: Replacing foreign characters in a string, but I can't make it work.

The problem arises when I pull data from a webservice, so ideally the solution I am looking for is a solution that handles many/all possible ways to represent the letters. (e.g. bold, italic, etc.)

Thanks!


Solution

  • There is library stringi (install.packages("stringi")) with stri_trans_nf* functions (Perform or Check For Unicode Normalization); check normalization forms for Unicode text for theory.

    string <- "Blah blah \U0001d617\U0001d622\U0001d63a\U0001d633\U0001d630\U0001d62d\U0001d62d \U0001d61a\U0001d631\U0001d626\U0001d624\U0001d62a\U0001d622\U0001d62d\U0001d62a\U0001d634\U0001d635 blah blah"
    library(stringi)
    stri_trans_nfkc(string)  # [1] "Blah blah Payroll Specialist blah blah"
    stri_trans_nfkd(string)  # [1] "Blah blah Payroll Specialist blah blah"