Search code examples
rtextunicode

Can I convert Unicode into plain text in R?


The data I am using has many characters like "<U+XXXX>". Originally, it looks like this as a data point, "<U+043E><U+043A><U+0430><U+0437><U+044B>: 673".

I am curious what I should use to convert them easily and effectively into ordinary plain texts. I have rows of this Unicode in my table, and I am confused now.

I was looking for ways of conversion online, but most of them don't work. For example, I have tried this code on my data to convert it from UTF-8 into Latin; it failed.

www <- c("<U+043C>")
www %>% iconv(from = "UTF-8", to = "latin1")
[1] <U+043C>

Also, I have tried this without arrows. Still, it doesn't convert.

www <- c("U+043C")
www %>% iconv(from = "UTF-8", to = "latin1")
[1] U+043C

Alternatively, I tried this function.

example <- c("<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025")
iconv(example, "UTF-8", "latin1")
[1] "<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025"

Any ideas, folks?


Solution

  • When you type "<U+043C>" it is being interpreted as a literal string of 8 characters. Whether this string is interpreted as latin-1 or UTF doesn't matter, since they both encode these literal 8 characters the same way.

    What you need to do is unescape the unicode strings. The stringi package can do this for you, but you need to do a bit of conversion first to get it in the right format. The following function should take care of it:

    
    f <- function(x) {
      
       x <- gsub(">", "", gsub("<U\\+", "\\\\u", x))
       stringi::stri_unescape_unicode(x)
    }
    

    So you can do:

    example <- c("<U+041F><U+043E><U+043A><U+0430><U+0437><U+044B>: 58025")
    www <- c("<U+043C>")
    
    f(example)
    #> [1] "Показы: 58025"
    
    f(www)
    #> [1] "м"