Search code examples
rdecodeencodeiconv

Decoding Cyrillic string in R


I would like to decode this string in R: обезпечен. The desired output should be: обезпечен

This site suggest that the source encoding is UTF-8 and it should be trans-coded to Windows-1251. So I tried with no success this:

> word <- "обезпечен"
> iconv(word, from = "UTF-8",to = "Windows-1251")
[1] "обезпечен"

Solution

  • These steps seem to do the trick

    word <- "обезпечен"
    
    xx <- iconv(word, from="UTF-8", to="cp1251")
    Encoding(xx) <- "UTF-8"
    xx
    # [1] "обезпечен"
    
    target <- "обезпечен"
    xx == target
    # [1] TRUE
    

    So it seems what happened was at one point the bytes that make up the UTF-8 target value were misinterpreted as being cp1251 encoded and somewhere a process ran to convert the bytes to UTF-8 based on the cp1251->UTF-8 mapping rules. However, when you run this on data that insn't really cp1251 encoded you get weird values.

    iconv(target, from="cp1251", to="UTF-8")
    # "обезпечен"