Search code examples
rregexasciinon-ascii-charactersstringi

How to convert a long vector of class character containing non-ASCII unicode characters to their escaped version?


I have an R package in which I have a list of university names that I want to match to the user input. The list of names contains special characters and this is generating a warning in R CMD check:

checking data for non-ASCII characters (855ms)
     Warning: found non-ASCII strings

Ideally, I would like to convert these non-ASCII unicode characters to their ASCII-compliant escaped version to get rid of this warning. Instead of doing it by hand on all of almost 10k rows, I would rather have a way to automatize the process from the data-generating script in the data-raw folder.

I think I am really close using stringi::stri_escape_unicode(), but it adds an extra backslash which is hard to get rid of. Here is a reprex with my attempts:

uni <- c("Université d'Abobo-Adjamé",
         "Université de Bouaké",
         "Universidad Católica Cardenal Raúl Silva Henríquez")
uni
#> [1] "Université d'Abobo-Adjamé"                         
#> [2] "Université de Bouaké"                              
#> [3] "Universidad Católica Cardenal Raúl Silva Henríquez"

uni2 <- stringi::stri_escape_unicode(uni)
uni2
#> [1] "Universit\\u00e9 d\\'Abobo-Adjam\\u00e9"                             
#> [2] "Universit\\u00e9 de Bouak\\u00e9"                                    
#> [3] "Universidad Cat\\u00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez"

# gsub removes too many and the special characters are lost
gsub("\\\\", "", uni2)
#> [1] "Universitu00e9 d'Abobo-Adjamu00e9"                             
#> [2] "Universitu00e9 de Bouaku00e9"                                  
#> [3] "Universidad Catu00f3lica Cardenal Rau00fal Silva Henru00edquez"

# sub removes only the first one so would not work... unless we make it a list!
uni3 <- as.list(uni2)

# But sometimes there are more than one non-ASCII characters and those get missed...
lapply(uni3, \(x) {
  sub("\\\\", "", x)
}) |> unlist()
#> [1] "Universitu00e9 d\\'Abobo-Adjam\\u00e9"                             
#> [2] "Universitu00e9 de Bouak\\u00e9"                                    
#> [3] "Universidad Catu00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez"

Created on 2023-07-19 with reprex v2.0.2

I feel like the correct approach must be with stringi::stri_encode(), but I did not find the right way to use it yet:

uni <- c("Université d'Abobo-Adjamé",
         "Université de Bouaké",
         "Universidad Católica Cardenal Raúl Silva Henríquez")

# Not the expected result
stringi::stri_encode(uni, from = "UTF-8", to = "latin2")
#> [1] "Universit\\xe9 d'Abobo-Adjam\\xe9"                            
#> [2] "Universit\\xe9 de Bouak\\xe9"                                 
#> [3] "Universidad Cat\\xf3lica Cardenal Ra\\xfal Silva Henr\\xedquez"

stringi::stri_encode(uni, from = "UTF-8", to = "latin1")
#> [1] "Université d'Abobo-Adjamé"                         
#> [2] "Université de Bouaké"                              
#> [3] "Universidad Católica Cardenal Raúl Silva Henríquez"

Created on 2023-07-19 with reprex v2.0.2

Surely, there is a better way to do this? If only stringi::stri_escape_unicode() had an argument to specify a single backslash, that would would work.


Solution

  • Doubled backslashes are there because in R backslashes need to be escaped - what you have is how it is meant to be. If we write the escaped vector to a csv file, we can see the double escaped characters go away:

    uni  %>% 
      stringi::stri_escape_unicode() %>%
      as.data.frame() %>%
      write_csv("test.csv", col_names = FALSE)
    
    # test.csv
    Universit\u00e9 d\'Abobo-Adjam\u00e9
    Universit\u00e9 de Bouak\u00e9
    Universidad Cat\u00f3lica Cardenal Ra\u00fal Silva Henr\u00edquez