Search code examples
runicodeencodingutf-8r-haven

R import of stata file has problems with French accented characters


I have a large stata file that I think has some French accented characters that have been saved poorly.

When I import the file with the encoding set to blank, it won't read in. When I set it to latin1 it will read in, but in one variable, and I'm certain in others, French accented characters are not rendered properly. I had a similar problem with another stata file and I tried to apply the fix (which actually did not work in that case, but seems on point) here.

To be honest this seems to be the real problem here somehow. A lot of the garbled characters are "actual" and they match up to what is "expected" But I have no idea to go back.

Reproducible code is here:


library(haven)
library(here)
library(tidyverse)
library(labelled)
#Download file
temp <- tempfile()
temp2 <- tempfile()

download.file("https://github.com/sjkiss/Occupation_Recode/raw/main/Data/CES-E-2019-online_F1.dta.zip", temp)
unzip(zipfile = temp, exdir = temp2)
ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="latin1")

#Try with encoding set to blank, it won't work. 
#ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="")

unlink(c(temp, temp2))

#### Diagnostic section for accented characters ####
ces19web$cps19_prov_id
#Note value labels are cut-off at accented characters in Quebec. 
#I know this occupation has messed up characters
ces19web %>% 
  filter(str_detect(pes19_occ_text,"assembleur-m")) %>% 
  select(cps19_ResponseId, pes19_occ_text)
#Check the encodings of the occupation titles and store in a variable encoding
ces19web$encoding<-Encoding(ces19web$pes19_occ_text)
#Check encoding of problematic characters
ces19web %>% 
  filter(str_detect(pes19_occ_text,"assembleur-m")) %>% 
  select(cps19_ResponseId, pes19_occ_text, encoding) 
#Write out messy occupation titles
ces19web %>% 
  filter(str_detect(pes19_occ_text,"Ã|©")) %>% 
  select(cps19_ResponseId, pes19_occ_text, encoding) %>% 
  write_csv(file=here("Data/messy.csv"))

#Try to fix

source("https://github.com/sjkiss/Occupation_Recode/raw/main/fix_encodings.R")
#store the messy variables in messy
messy<-ces19web$pes19_occ_text
library(stringi)
#Try to clean with the function fix_encodings
ces19web$pes19_occ_text_cleaned<-stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)

#Examine
ces19web %>% 
  filter(str_detect(pes19_occ_text_cleaned,"Ã|©")) %>% 
  select(cps19_ResponseId, pes19_occ_text, pes19_occ_text_cleaned, encoding) %>% 
head()


Solution

  • Assuming an utf-8 locale, can be checked with:

    Sys.getlocale()
    #> [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
    

    At first we had this somewhere and everything was fine:

    utf8 <- "Producteur télé"
    Encoding(utf8)
    #> [1] "UTF-8"
    charToRaw(utf8) # é encoded to c3 a9 as expected for utf-8
    #>  [1] 50 72 6f 64 75 63 74 65 75 72 20 74 c3 a9 6c c3 a9
    utf8
    #> [1] "Producteur télé"
    

    But something bad happened, and the string was considered as a latin string for which c3 and a9 are 2 separate chatacters "Ã" and "©", and was converted wrongly from latin1 to utf8, so now instead of having é in UTF-8 (with bytes that translate to "é" in latin), we have "é" in utf8, coded with the 2 characters c3 83 and c2 a9

    oops <- iconv(utf8, from = "latin1", to = "UTF-8")
    Encoding(oops)
    #> [1] "UTF-8"
    charToRaw(oops)
    #>  [1] 50 72 6f 64 75 63 74 65 75 72 20 74 c3 83 c2 a9 6c c3 83 c2 a9
    oops
    #> [1] "Producteur télé"
    

    This string is not a proper (meaningful) utf-8 or latin1 string anymore, "é" is e9 in latin, or c3 a9 in utf-8, but never c3 83 c2 a9 !

    We can undo the bad translation though:

    proper_utf8_encoding_with_latin1_marking <- 
      iconv(oops, from = "UTF-8", to = "latin1")
    Encoding(proper_utf8_encoding_with_latin1_marking)
    #> [1] "latin1"
    # c3 a9 is é in utf-8, not in latin1!
    charToRaw(proper_utf8_encoding_with_latin1_marking) 
    #>  [1] 50 72 6f 64 75 63 74 65 75 72 20 74 c3 a9 6c c3 a9
    proper_utf8_encoding_with_latin1_marking
    #> [1] "Producteur télé"
    

    From there we can build either a proper utf-8 string (recommended) or a proper latin1 string

    utf8 <- proper_utf8_encoding_with_latin1_marking
    Encoding(utf8) <- "UTF-8"
    Encoding(utf8)
    #> [1] "UTF-8"
    
    charToRaw(utf8)
    #>  [1] 50 72 6f 64 75 63 74 65 75 72 20 74 c3 a9 6c c3 a9
    
    utf8
    #> [1] "Producteur télé"
    
    latin1 <- 
      iconv(proper_utf8_encoding_with_latin1_marking, from = "UTF-8", to = "latin1")
    Encoding(latin1)
    #> [1] "latin1"
    charToRaw(latin1) # e9 is é in latin1
    #>  [1] 50 72 6f 64 75 63 74 65 75 72 20 74 e9 6c e9
    latin1
    #> [1] "Producteur télé"
    

    Part of encoding hell is that R sees those MOSTLY as the same, because it mostly doesn't matter

    identical(utf8, latin1)
    #> [1] TRUE
    

    But the truth can be seen with the Encoding() and charToRaw() functions, or when serializing, which shows both informations.

    waldo::compare(
      serialize(utf8, NULL),
      serialize(latin1, NULL)
    )
    #> `old[31:42]`: "01" "00" "00" "80" "09" "00" "00" "00" "11" "50" and 2 more...
    #> `new[31:42]`: "01" "00" "00" "40" "09" "00" "00" "00" "0f" "50" ...          
    #> 
    #> `old[49:56]`: "72" "20" "74" "c3" "a9" "6c" "c3" "a9"
    #> `new[49:54]`: "72" "20" "74" "e9" "6c" "e9"
    

    The 3 differences we see above are the encoding marking (80 for UTF-8, 40 for latin1, 00 for unknown), the length in byte (11=17 in decimal, 0f = 15 in decimal), and the byte values of the "é" characters ("c3" "a9" vs "e9")

    Fun fact, if we change the locale to latin1 (here on a Mac), for reasons that I don't understand, oops will actually print "é" (and the others won't print well anymore), proving that we can't always trust print() and identical(), and that charToRaw(), Encoding() and iconv() are your friends to debug encoding hell.

    Sys.setlocale("LC_CTYPE", "en_US.ISO8859-1")
    oops
    #> [1] "Producteur télé"