I have data from a large Qualtrics survey that was processed in Stata before I got it. I'm now trying to clean up the data in R. Some of the Portuguese characters have been replaced with �. I'm trying to flag text entry responses to a series of questions that were originally "não" ["no" in English] and are now recorded as "n�o". I can see in tests below that gsub() and grepl() can identify "�" in both a list or data frame, but when I try to use the same commands on the the real data, both commands fail to identify "n�o" and even "�". There is no error; it just fails to substitute for gsub() and marks FALSE when it should be TRUE for grepl().
Are there multiple types of � based on the underlying character? Is there some way to search for or replace � characters that will pick up any instance?
This example shows that gsub() and grepl() both work fine on a list or data frame:
list <- c("n�o ç não", "n�o", "nao", "não")
gsub("�", "ã", list)
grepl("�", list)
library(dplyr)
df <- data.frame(list)
df.new <- df %>%
mutate(
sub = gsub("�", "ã", df$list),
replace = grepl("�", list))
df.new$sub
df.new$replace
[1] "não ç não" "não" "nao" "não"
[1] TRUE TRUE FALSE FALSE
[1] "não ç não" "não" "nao" "não"
[1] TRUE TRUE FALSE FALSE
This same code fails to identify "�" in my real data.
My guess is you are on a windows machine, which sometimes doesn't play nice with unicode characters. To re-create im parsing your actual post to show you what you can do. I suggest using the stringi
library, and using to replace all the characters that you know to be ã
as a short-cut, but really you'd want to handle each possible-case with a blanket solution. Checkout ?stringi-search-charclass
for more info on how to do this, but.. from your original post:
I have data from a large Qualtrics survey that was processed in Stata before I got it. I'm now trying to clean up the data in R. Some of the Portuguese characters have been replaced with �. I'm trying to flag text entry responses to a series of questions that were originally "não" ["no" in English] and are now recorded as "n�o". I can see in tests below that gsub() and grepl() can identify "�" in both a list or data frame, but when I try to use the same commands on the the real data, both commands fail to identify "n�o" and even "�". There is no error; it just fails to substitute for gsub() and marks FALSE when it should be TRUE for grepl().
We get:
library(xml2)
library(stringi)
this_post = "https://stackoverflow.com/questions/66540384/identifying-unicode-replacement-characters-ufffd-or-or-black-diamond-questio#66540384"
read_html(this_post) %>%
xml_find_all('//*[@id="question"]/div/div[2]/div[1]/p[1]') %>%
xml_text() %>% stri_replace_all_regex("\\p{So}", "ã")
I have data from a large Qualtrics survey that was processed in Stata before I got it. I'm now trying to clean up the data in R. Some of the Portuguese characters have been replaced with ã. I'm trying to flag text entry responses to a series of questions that were originally "não" ["no" in English] and are now recorded as "não". I can see in tests below that gsub() and grepl() can identify "ã" in both a list or data frame, but when I try to use the same commands on the the real data, both commands fail to identify "não" and even "ã". There is no error; it just fails to substitute for gsub() and marks FALSE when it should be TRUE for grepl().
stringi::stri_escape_unicode(orig_data) %>%
stringi::stri_replace_all_regex("\\p{So}", "ã")
You cannot grepl
with the unknown-char, because the function has no idea what you are asking it to match, instead try this:
stringi::stri_unescape_unicode("\\u00e3")
[1] "ã"
grepl("\u00e3", stringi::stri_escape_unicode(orig_data), perl = TRUE)
[1] TRUE FALSE FALSE TRUE
Below is a good solution, as the "question mark" chars you were getting are likely lost as a result of being ascii. NOTE that in the example I gave, you would merely be replacing ANY/ALL bad-chars with the "ã". Obviously this isn't a good approach, but if you read the help docs i am sure you'll see how you can use this approach blended with an escape to work for all your strings.
orig_data$repaired_text <- stringi::stri_enc_toutf8(orig_data$text) %>% stringi::stri_replace_all_regex("\\p{So}", "ã")