Search code examples
rencodingcharacter-encodingreadr

Read in file with UTF-8 character in path in R


Let's assume I have a large amount of *.rds files with some have UTF-8 characters in their path. For some reason R can't handle some special accents. For example enc2utf8("Č"), which should print "Č" but on my end it converts to 'C" which makes it impossible for R to recognize the file. Any ideas how to handle such cases/help R with the encoding?

Session info output :

>session.info()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.7.9 here_0.1        forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2     purrr_0.3.4    
 [7] readr_1.3.1     tidyr_1.1.2     tibble_3.0.3    ggplot2_3.3.2   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5       cellranger_1.1.0 pillar_1.4.6     compiler_4.0.2   dbplyr_1.4.4     tools_4.0.2     
 [7] jsonlite_1.7.2   lifecycle_1.0.0  gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.10     reprex_0.3.0    
[13] cli_2.4.0        DBI_1.1.0        rstudioapi_0.13  haven_2.3.1      withr_2.4.2      xml2_1.3.2      
[19] httr_1.4.2       fs_1.5.0         generics_0.1.0   vctrs_0.3.3      hms_0.5.3        rprojroot_1.3-2 
[25] neuralnet_1.44.2 grid_4.0.2       tidyselect_1.1.0 glue_1.4.2       R6_2.4.1         readxl_1.3.1    
[31] modelr_0.1.8     blob_1.2.1       magrittr_1.5     backports_1.1.9  scales_1.1.1     ellipsis_0.3.1  
[37] rvest_0.3.6      assertthat_0.2.1 colorspace_1.4-1 stringi_1.4.6    munsell_0.5.0    broom_0.7.0     
[43] crayon_1.3.4   

@EDIT I :

Clarification : R can't read the file path due to UTF-8 characters in the file name.

Original file path example : G:/Users/SomeUser/Documents/University/2021/Project_M/data/procyclingstats/BORA_hansgrohe/POLJAŃSKI_Paweł_sprinter_point.rds

Neither readRDS from base nor read_rds from the readr package can encode the path correctly.

Both produce the following error :

Error in gzfile(file, "rb") : cannot open the connection In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '

G:/Users/SomeUser/Documents/University/2021/Project_M/data/procyclingstats/BORA_hansgrohe/POLJANSKI_Pawel_sprinter_point.rds', probable reason 'No such file or directory

I don't load the paths with a sourced *.txt file but have a function which creates a list of files in given directories.

This function prints the file path correctly. So it's not a problem with my way to concatenate the path-string .

 str_c(outputDIR_pro[i],
                   sub(".+/data/Strava/.+/([0-9]+?).txt", "\\1", athlethes[[i]][[j]]) %>% str_match('\\d+') %>% 
                    str_detect(names_id_vec,.) %>%
                     names_id_vec[.] %>%
                     str_remove('\\d+;'),'_sprinter_point', '.rds') # %>% readRDS
[1] " G:/Users/SomeUser/Documents/University/2021/Project_M/data/procyclingstats/BORA_hansgrohe /POLJAŃSKI_Paweł_sprinter_point.rds"

Solution

  • At first I thought your locale was the problem; windows-1252 doesn't contain "Ń". But I couldn't reproduce your error even with filenames like "🦄.rds" with latin1 encoding and german locale.

    But the amount of whitespace in your error was more that I got for files that didn't exist... Then I spotted the leading space in your example output.

    [1] " G:/Users/SomeUser/Documents/University/2021/Project_M/data/procyclingstats/BORA_hansgrohe /POLJAŃSKI_Paweł_sprinter_point.rds"
    

    That could explain why it prints "okay" (we don't see whitespace), but trying to read would fail. It does leave me puzzled about why your other files read without problem.

    If that isn't the problem than it may be the relative recent support for utf-8 in Windows. Historically they have used ucs-2 and utf-16 internally. "Turning on" utf-8 support requires a different C runtime. There is an experimental build of R that you could try out that uses that runtime. But that requires you to rebuild your libraries (readr!) with that runtime too.

    Before messing up your whole R installation, I'd test with the experimental build if you can read a file called Ń.csv.