Identifying Unicode replacement characters (U+FFFD or � or black diamond question mark) in R

I have data from a large Qualtrics survey that was processed in Stata before I got it. I'm now trying to clean up the data in R. Some of the Portuguese characters have been replaced with �. I'm trying to flag text entry responses to a series of questions that were originally "não" ["no" in English] and are now recorded as "n�o". I can see in tests below that gsub() and grepl() can identify "�" in both a list or data frame, but when I try to use the same commands on the the real data, both commands fail to identify "n�o" and even "�". There is no error; it just fails to substitute for gsub() and marks FALSE when it should be TRUE for grepl().

Are there multiple types of � based on the underlying character? Is there some way to search for or replace � characters that will pick up any instance?

This example shows that gsub() and grepl() both work fine on a list or data frame:

list <- c("n�o ç não", "n�o", "nao", "não")
gsub("�", "ã", list)
grepl("�", list)

library(dplyr)
df <- data.frame(list)
df.new <- df %>%
  mutate(
    sub = gsub("�", "ã", df$list),
    replace = grepl("�", list))
df.new$sub
df.new$replace

[1] "não ç não" "não" "nao" "não"
[1] TRUE TRUE FALSE FALSE

This same code fails to identify "�" in my real data.

Solution

My guess is you are on a windows machine, which sometimes doesn't play nice with unicode characters. To re-create im parsing your actual post to show you what you can do. I suggest using the stringi library, and using to replace all the characters that you know to be ã as a short-cut, but really you'd want to handle each possible-case with a blanket solution. Checkout ?stringi-search-charclass for more info on how to do this, but.. from your original post:

I have data from a large Qualtrics survey that was processed in Stata before I got it. I'm now trying to clean up the data in R. Some of the Portuguese characters have been replaced with �. I'm trying to flag text entry responses to a series of questions that were originally "não" ["no" in English] and are now recorded as "n�o". I can see in tests below that gsub() and grepl() can identify "�" in both a list or data frame, but when I try to use the same commands on the the real data, both commands fail to identify "n�o" and even "�". There is no error; it just fails to substitute for gsub() and marks FALSE when it should be TRUE for grepl().

We get:

library(xml2)
library(stringi)
this_post = "https://stackoverflow.com/questions/66540384/identifying-unicode-replacement-characters-ufffd-or-or-black-diamond-questio#66540384"
read_html(this_post) %>% 
    xml_find_all('//*[@id="question"]/div/div[2]/div[1]/p[1]') %>% 
        xml_text() %>% stri_replace_all_regex("\\p{So}", "ã")

I have data from a large Qualtrics survey that was processed in Stata before I got it. I'm now trying to clean up the data in R. Some of the Portuguese characters have been replaced with ã. I'm trying to flag text entry responses to a series of questions that were originally "não" ["no" in English] and are now recorded as "não". I can see in tests below that gsub() and grepl() can identify "ã" in both a list or data frame, but when I try to use the same commands on the the real data, both commands fail to identify "não" and even "ã". There is no error; it just fails to substitute for gsub() and marks FALSE when it should be TRUE for grepl().

For your original data... see if this works:

stringi::stri_escape_unicode(orig_data)  %>% 
    stringi::stri_replace_all_regex("\\p{So}", "ã")

One-more thing

You cannot grepl with the unknown-char, because the function has no idea what you are asking it to match, instead try this:

stringi::stri_unescape_unicode("\\u00e3")
[1] "ã"
grepl("\u00e3", stringi::stri_escape_unicode(orig_data), perl = TRUE)
[1]  TRUE FALSE FALSE  TRUE

EDIT based on comment:

Below is a good solution, as the "question mark" chars you were getting are likely lost as a result of being ascii. NOTE that in the example I gave, you would merely be replacing ANY/ALL bad-chars with the "ã". Obviously this isn't a good approach, but if you read the help docs i am sure you'll see how you can use this approach blended with an escape to work for all your strings.

orig_data$repaired_text <- stringi::stri_enc_toutf8(orig_data$text)  %>%      stringi::stri_replace_all_regex("\\p{So}", "ã")