Search code examples
rregexunicode

r - Remove Unicode replacement character from a string


I have a dataset of a hundred million rows, out of which about 10 have some sort of Unicode replacement character. Text representation of this particular character is "< U+FFFD>" (remove whitespace), however there are others, too.

I want to remove the character, but i wasn't able to come up with a way to do that.

str <- "торгово производственн��я компания"
gsub("<U+FFFD>", "", str)
"торгово производственн��я компания"

If i need to provide any additional info please let me know. Also i would be very grateful for an explanation of what exactly is happening here (as in why a normal gsub doesn't work and why it displays like that)


Solution

  • You are using a gsub function with a regex pattern as the first argument. <U+FFFD> pattern matches <, 1 or more U symbols, and then a FFFD> sequence of chars.

    It would work like this:

    > str2 <- "торгово <UUUFFFD> производственн��я компания"
    > gsub("<U+FFFD>", "", str2)
    [1] "торгово  производственн��я компания"
    

    Use a mere literal string replacement:

    > str <- "торгово производственн��я компания"
    > gsub("\uFFFD", "", str, fixed=TRUE)
    [1] "торгово производствення компания"