I have an R package in which I have a list of university names that I want to match to the user input. The list of names contains special characters and this is generating a warning in R CMD check:
checking data for non-ASCII characters (855ms)
Warning: found non-ASCII strings
Ideally, I would like to convert these non-ASCII unicode characters to their ASCII-compliant escaped version to get rid of this warning. Instead of doing it by hand on all of almost 10k rows, I would rather have a way to automatize the process from the data-generating script in the data-raw folder.
I think I am really close using stringi::stri_escape_unicode()
, but it adds an extra backslash which is hard to get rid of. Here is a reprex with my attempts:
uni <- c("Université d'Abobo-Adjamé",
"Université de Bouaké",
"Universidad Católica Cardenal Raúl Silva Henríquez")
uni
#> [1] "Université d'Abobo-Adjamé"
#> [2] "Université de Bouaké"
#> [3] "Universidad Católica Cardenal Raúl Silva Henríquez"
uni2 <- stringi::stri_escape_unicode(uni)
uni2
#> [1] "Universit\\u00e9 d\\'Abobo-Adjam\\u00e9"
#> [2] "Universit\\u00e9 de Bouak\\u00e9"
#> [3] "Universidad Cat\\u00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez"
# gsub removes too many and the special characters are lost
gsub("\\\\", "", uni2)
#> [1] "Universitu00e9 d'Abobo-Adjamu00e9"
#> [2] "Universitu00e9 de Bouaku00e9"
#> [3] "Universidad Catu00f3lica Cardenal Rau00fal Silva Henru00edquez"
# sub removes only the first one so would not work... unless we make it a list!
uni3 <- as.list(uni2)
# But sometimes there are more than one non-ASCII characters and those get missed...
lapply(uni3, \(x) {
sub("\\\\", "", x)
}) |> unlist()
#> [1] "Universitu00e9 d\\'Abobo-Adjam\\u00e9"
#> [2] "Universitu00e9 de Bouak\\u00e9"
#> [3] "Universidad Catu00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez"
Created on 2023-07-19 with reprex v2.0.2
I feel like the correct approach must be with stringi::stri_encode()
, but I did not find the right way to use it yet:
uni <- c("Université d'Abobo-Adjamé",
"Université de Bouaké",
"Universidad Católica Cardenal Raúl Silva Henríquez")
# Not the expected result
stringi::stri_encode(uni, from = "UTF-8", to = "latin2")
#> [1] "Universit\\xe9 d'Abobo-Adjam\\xe9"
#> [2] "Universit\\xe9 de Bouak\\xe9"
#> [3] "Universidad Cat\\xf3lica Cardenal Ra\\xfal Silva Henr\\xedquez"
stringi::stri_encode(uni, from = "UTF-8", to = "latin1")
#> [1] "Université d'Abobo-Adjamé"
#> [2] "Université de Bouaké"
#> [3] "Universidad Católica Cardenal Raúl Silva Henríquez"
Created on 2023-07-19 with reprex v2.0.2
Surely, there is a better way to do this? If only stringi::stri_escape_unicode()
had an argument to specify a single backslash, that would would work.
Doubled backslashes are there because in R backslashes need to be escaped - what you have is how it is meant to be. If we write the escaped vector to a csv file, we can see the double escaped characters go away:
uni %>%
stringi::stri_escape_unicode() %>%
as.data.frame() %>%
write_csv("test.csv", col_names = FALSE)
# test.csv
Universit\u00e9 d\'Abobo-Adjam\u00e9
Universit\u00e9 de Bouak\u00e9
Universidad Cat\u00f3lica Cardenal Ra\u00fal Silva Henr\u00edquez