Search code examples
rencoding

unable to translate '...' to a wide string


It looks to me like R introduced a new error in version 4.3.0, which breaks a lot of my web-scrapers. I only found one mention of the change, but don't really understand the blog post.

In essence, this code fails on newer versions of R, but older versions do some internal conversion that seems to work:

text <- "\xa0 x"
gsub("x", "u", text)
#> Warning in gsub("x", "u", text): unable to translate '<a0> x' to a wide string
#> Error in gsub("x", "u", text): input string 1 is invalid

Created on 2023-07-13 with reprex v2.0.2

Is there any way to remove these special characters before doing string operations? Note that I do not know which characters specifically fail, since the real strings I'm working with are too long to check.


Solution

  • It's an encoding issue, text is not interpreted as a valid string because it contains non-ASCII characters.

    Conversion to UTF-8:

    text_utf8 <- iconv(text, from = "ISO-8859-1", to = "UTF-8")
    gsub("x","u", text_utf8)
    

    will produce: ' u'.

    R 4.3.0 changelog says: "Regular expression functions now check more thoroughly whether their inputs are valid strings (in their encoding, e.g. in UTF-8)."

    You could also treat input as sequence of bytes (this will also be preserved in the output).

    gsub("x", "u", text, useBytes = TRUE)
    

    gives '\xa0 u'