r elasticsearch utf-8 character-encoding stringi

R String Encoding from "unknown"/"ASCII" to "UTF-8"

I'm not really sure how to make this into a reproducible example, and for that I apologize. But I have a data frame with a string column. When I run stri_enc_mark on the column, I see that I have both 'ASCII' and 'UTF-8' encoded strings. This is an issue because when I try to upload this data into an elastic search database, then I run into the following error:

"Invalid UTF-8 start byte 0xa0\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@40d00701; line: 1, column: 1425]"

I'm assuming this is because of the ASCII encoded strings. I tried to use write.csv(... fileEncoding = 'UTF-8') but when I load up that CSV the string column still has a mix of encodings. Neither Encoding(x) <- 'UTF-8', stri_enc_toutf8, nor stri_encode seem to help out with the conversion.

Any advice or guidance would be awesome.

Solution

Thanks to @MrFlick I was able to solve the problem. Essentially, given a data frame with character columns of mixed encodings, the easiest work around was to:

df %>%
  mutate_if(is.character, function(x){
    x %>%
      sapply(function(y){
        y %>%
          charToRaw %>%
          rawToChar
      })
   })

This makes sure that all the characters are encoded in the same native encoding. This solves the issue where I was unable to load the data into elastic search due to encoding inconsistencies.