Search code examples
runicode

gsub in R with unicode replacement give different results under Windows compared with Unix?


Running the following commands in R under Mac or Linux produces the expected result, that is the greek letter beta:

gsub("<U\\+[0-9A-F]{4}>", "\u03B2", "<U+03B2>")

"\u03B2"

However, running the first command under Windows, produces the wrong result, but the 2nd give the correct beta output. I tried 3 versions of R on Windows (3.0.2, 3.1.1, and 3.1.2). They all consistently printed the "wrong" result. (Cannot post the output as I don't have access to Windows now.)

In addition, is it possible to convert unicodes from format < U+FFFF> (ignore the space, as without it the website doesn't display anything) to "\uFFFF" using gsub?

Thank you very much.

UPDATE:

Stealing MrFlick's solution, I hacked the following ugly solution in case there are multiple Unicodes in a sentence. However, the fix is really ugly, so feel free to post improvements.

test.string <- "This is a <U+03B1> <U+03B2> <U+03B2> <U+03B3> test <U+03B4> string."

trueunicode.hack <- function(string){
    m <- gregexpr("<U\\+[0-9A-F]{4}>", string)
    if(-1==m[[1]][1])
        return(string)

    codes <- unlist(regmatches(string, m))
    replacements <- codes
    N <- length(codes)
    for(i in 1:N){
        replacements[i] <- intToUtf8(strtoi(paste0("0x", substring(codes[i], 4, 7))))
    }

    # if the string doesn't start with a unicode, the copy its initial part
    # until first occurrence of unicode
    if(1!=m[[1]][1]){
        y <- substring(string, 1, m[[1]][1]-1)
        y <- paste0(y, replacements[1])
    }else{
        y <- replacements[1]
    }

    # if more than 1 unicodes in the string
    if(1<N){
        for(i in 2:N){
            s <- gsub("<U\\+[0-9A-F]{4}>", replacements[i], 
                      substring(string, m[[1]][i-1]+8, m[[1]][i]+7))
            Encoding(s) <- "UTF-8"
            y <- paste0(y, s)
        }
    }

    # get the trailing contents, if any
    if( nchar(string)>(m[[1]][N]+8) )
        y <- paste0( y, substring(string, m[[1]][N]+8, nchar(string)) )
    y
}

test.string
trueunicode.hack(test.string)

Results:

"This is a <U+03B1> <U+03B2> <U+03B2> <U+03B3> test <U+03B4> string."
"This is a α β β γ test δ string."

Solution

  • If you're not seeing the right character on Windows, try explicitly setting the encoding

    x <- gsub("<U\\+[0-9A-F]{4}>", "\u03B2", "<U+03B2>")
    Encoding(x) <- "UTF-8"
    x
    

    As far as replacing all such symbols with unicode characters, i've adapted this answer to do a similar thing. Here we build the unicode character as a raw vector. Here's a helper function

    trueunicode <- function(x) {
        packuni<-Vectorize(function(cp) {
            bv <- intToBits(cp)
            maxbit <- tail(which(bv!=as.raw(0)),1)
            if(maxbit < 8) {
                rawToChar(as.raw(codepoint))
            } else if (maxbit < 12) {
                rawToChar(rev(packBits(c(bv[1:6], as.raw(c(0,1)), bv[7:11], as.raw(c(0,1,1))), "raw")))
            } else if (maxbit < 17){
                rawToChar(rev(packBits(c(bv[1:6], as.raw(c(0,1)), bv[7:12], as.raw(c(0,1)), bv[13:16], as.raw(c(0,1,1,1))), "raw")))    
            } else {
               stop("too many bits")
            }
        })
        m <- gregexpr("<U\\+[0-9a-fA-F]{4}>", x)
        codes <- regmatches(x,m)
        chars <- lapply(codes, function(x) {
            codepoints <- strtoi(paste0("0x", substring(x,4,7)))
            packuni(codepoints)
    
        })
        regmatches(x,m) <- chars
        Encoding(x)<-"UTF-8"
        x
    }
    

    and then we can use it like

    x <- c("beta <U+03B2>", "flipped e <U+018F>!", "<U+2660> <U+2663> <U+2665> <U+2666>")
    trueunicode(x)
    # [1] "beta β"       "flipped e Ə!" "♠ ♣ ♥ ♦"