Search code examples
relasticsearchutf-8character-encodingstringi

R String Encoding from "unknown"/"ASCII" to "UTF-8"


I'm not really sure how to make this into a reproducible example, and for that I apologize. But I have a data frame with a string column. When I run stri_enc_mark on the column, I see that I have both 'ASCII' and 'UTF-8' encoded strings. This is an issue because when I try to upload this data into an elastic search database, then I run into the following error:

"Invalid UTF-8 start byte 0xa0\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@40d00701; line: 1, column: 1425]"

I'm assuming this is because of the ASCII encoded strings. I tried to use write.csv(... fileEncoding = 'UTF-8') but when I load up that CSV the string column still has a mix of encodings. Neither Encoding(x) <- 'UTF-8', stri_enc_toutf8, nor stri_encode seem to help out with the conversion.

Any advice or guidance would be awesome.


Solution

  • Thanks to @MrFlick I was able to solve the problem. Essentially, given a data frame with character columns of mixed encodings, the easiest work around was to:

    df %>%
      mutate_if(is.character, function(x){
        x %>%
          sapply(function(y){
            y %>%
              charToRaw %>%
              rawToChar
          })
       })
    

    This makes sure that all the characters are encoded in the same native encoding. This solves the issue where I was unable to load the data into elastic search due to encoding inconsistencies.